Zip files and Encoding – I hate you.

I’ve written about some of the issues with depending on zip as a packaging format in the past. As people know, Web Apps is depending on Zip as the packaging format for Widgets.

Zip the good

Zip has a lot going for it. It is ubiquitous and dependable… so long as you don’t want to share files across cultures.

Zip the bad

The Zip spec does not seem to know that there are normalization models for UTF-8, when there are actually 4 (or more, because there is some non-standard ones too!). The Zip file gives no guidance as to how file names inside zip files are to be normalized.

Consider, when a zip file is created on Linux, it just writes the bytes for the file name in the encoding of the underlying file system. So, if the file system is in ISO-8859-1, the bytes are written in ISO-8859-1. This may seem ok, but when you decompress the zip file on Windows, which runs on encoding Windows-1252, the file names get all mangled. If the underlying encoding of the file system on Linux is something else, you won’t be able to share files with other systems at all. So in this case, it is not Window’s fault.

The Zip spec says that the only supported encodings are CP437 and UTF-8, but everyone has ignored that. Implementers just encode file names however they want (usually byte for byte as they are in the OS… see table below).

It gets worst! because MacOS runs on some weird non-standard decomposed Unicode mode, you can only share zip files with other MacOs users. According to this email, the LimeWire guys also ran into a similar problem with regards to encodings in MacOS:

“for example a French, German or Spanish Windows user cannot exchange files that contain [file names with] French, German or Spanish accents with a French, German or Spanish Macintosh users”

The following table illustrates the problem:

Bytes that represent ñ in a Zip file (in hex)
File name Zip in Windows Zip in Linux Zip in Mac OS
ñ a4 (Extended US-ASCII/CP437) C3 B1 (UTF-8 NFC) 6E CC 83 (UTF-8 NFD)

Yes! holly crap! three different byte sequences corresponding to different character encodings.

The only way around this would be a *special* custom-built widget zipping tool that normalizes file name strings to NFC. If the widget engine needs to decompress the widget to disk, then it would take the NFC and convert them to the operating system’s native encoding (or store the files in memory, and reference them that way). This affects the URI scheme and DOM normalization of Widgets, so Web Apps will have to deal with it eventually… but not sure exactly how.

Widget spec is now Widget Specs

In an effort to expedite the standardization of widgets, the Web Application Formats Working Group yesterday decided to split the Widgets 1.0 Specification into three (or more) specs:

Other specs may also follow, particularly:

Other documents are still under development too:

We are aiming to have all these done (ie. Last Call) by October. However, now that the document split has happened, I should be able to get the packaging format done fairly quickly.

We have more or less now settled on the configuration language format. The elements are going to be:

  • <widget width=”” height=”” id=””>
    • <title: the title/name of a widget
    • <description> a description
    • <author email=”” url=””> some details about the author
    • <license> paste your GPL here! 🙂
    • <icon src=””> the icon
    • <access network=”true|false” plugins=”true|false”> if your widget need to get online
    • <content src=””> some file in the widget archive

Only <widget> and <content> are mandatory at this point.

The processing model for the XML is going to be quite forgiving. The only thing that will cause an error, is not having a well-formed document.  For example, the following the following would result in “The Awesome Super Dude Widget” as the title:

<widget xmlns="http://www.w3.org/ns/widgets">
   <title>
     The <blink>Awesome</blink> 
     <author email="dude@example.com">Super Dude</author> Widget</title>
</widget>

The unrecognized elements are simply ignored, but their text content is extracted. This makes processing more forgiving and allows for extensibility and some graceful degradation. I also want to push that the widget should function if the namespace is omitted.

We are also currently investigating how we are going to deal with internationalization in the configuration document format. We are looking at following ideas from the Best Practices for XML Internationalization.

WAF and WebAPI are dead. Long Live WebApps Working Group!

The charters of both  the W3C Web Application Formats and WebAPI Working Groups have now expired (as of the 15th of November, 2007) meaning they are effectively dead (although still twitching!). From their ashes will rise a new merged working group called the Web Applications Working group… hopefully by the 31 of January.

According to the new proposed charter, the missions of the new working group is to:

…is to provide specifications that enable improved client-side application development on the Web, including specifications both for application programming interfaces (APIs) for client-side development and for markup vocabularies for describing and controlling client-side application behavior.

The new Web Applications Working Group is chartered with the continual development of the following specifications:

Specification FPWD LC CR PR Rec
ClipOps spec 2007-Q2 2008-Q4 2009-Q2 2009-Q4 2010
DOM 3 Core bis spec          
DOM 3 Events spec 2007-Q2 2008-Q2 2008-Q4 2009-Q4 2010
Element Traversal spec 2007-Q2 2007-Q4 2008-Q2 2008-Q4 2008
Access Control spec 2006-Q2 2008-Q1 2008-Q3 2009-Q4 2010
File Upload spec 2007-Q2 2008-Q2 2008-Q4 2009-Q4 2010
Language Bindings spec 2007-Q2 2008-Q2 2008-Q4 2009-Q4 2010
MAXIM spec 2008-Q1 2008-Q3 2008-Q4 2009-Q2 2009
Network API spec 2008-Q2 2009-Q1 2009-Q3 2010-Q2 2010
Progress Events spec 2007-Q2 2008-Q2 2008-Q3 2009-Q2 2009
Selectors API spec 2007-Q2 2007-Q4 2008-Q2 2008-Q4 2008
XHR Object spec 2007-Q2 2008-Q2 2008-Q4 2009-Q4 2010
Widgets spec 2006-Q4 2008-Q4 2009-Q1 2009-Q3 2009-Q4
Widgets Requirements 2006-Q3 2008-Q4 2009-Q1 2009-Q3 2009-Q4
Window Object spec 2007-Q2 2008-Q2 2008-Q4 2009-Q4 2010
XBL2 spec 2006-Q2 2010 2011 2013 2013
XBL2 Primer 2007-Q3 2010 2011 2013 2013

Another cool thing about the new working group is that it is modeled on the HTML Working Group, meaning that is open, transparent (no secret chats on the members list) and anyone will be able to participate via the public mailing list.

I’ll continue to edit the Widget Spec and Requirements, and possibly continue to help out with the XBL Primer.  I’ll continue to be part of this new working group for a least 1 year, as I my PhD program ends in March 2009… and hopefully longer, if someone gives me a job to continue working on specs! 😉