Zip files and Encoding – I hate you.

I’ve written about some of the issues with depending on zip as a packaging format in the past. As people know, Web Apps is depending on Zip as the packaging format for Widgets.

Zip the good

Zip has a lot going for it. It is ubiquitous and dependable… so long as you don’t want to share files across cultures.

Zip the bad

The Zip spec does not seem to know that there are normalization models for UTF-8, when there are actually 4 (or more, because there is some non-standard ones too!). The Zip file gives no guidance as to how file names inside zip files are to be normalized.

Consider, when a zip file is created on Linux, it just writes the bytes for the file name in the encoding of the underlying file system. So, if the file system is in ISO-8859-1, the bytes are written in ISO-8859-1. This may seem ok, but when you decompress the zip file on Windows, which runs on encoding Windows-1252, the file names get all mangled. If the underlying encoding of the file system on Linux is something else, you won’t be able to share files with other systems at all. So in this case, it is not Window’s fault.

The Zip spec says that the only supported encodings are CP437 and UTF-8, but everyone has ignored that. Implementers just encode file names however they want (usually byte for byte as they are in the OS… see table below).

It gets worst! because MacOS runs on some weird non-standard decomposed Unicode mode, you can only share zip files with other MacOs users. According to this email, the LimeWire guys also ran into a similar problem with regards to encodings in MacOS:

“for example a French, German or Spanish Windows user cannot exchange files that contain [file names with] French, German or Spanish accents with a French, German or Spanish Macintosh users”

The following table illustrates the problem:

Bytes that represent ñ in a Zip file (in hex)
File name Zip in Windows Zip in Linux Zip in Mac OS
ñ a4 (Extended US-ASCII/CP437) C3 B1 (UTF-8 NFC) 6E CC 83 (UTF-8 NFD)

Yes! holly crap! three different byte sequences corresponding to different character encodings.

The only way around this would be a *special* custom-built widget zipping tool that normalizes file name strings to NFC. If the widget engine needs to decompress the widget to disk, then it would take the NFC and convert them to the operating system’s native encoding (or store the files in memory, and reference them that way). This affects the URI scheme and DOM normalization of Widgets, so Web Apps will have to deal with it eventually… but not sure exactly how.

IE8 XDomainRequest conspiracy theory

UPDATE: This conspiracy theory has been debunked. Microsoft said they would implement various aspects of the access-control spec in IE8. For what it’s worth, those Microsoft guys are ok with me 🙂

I love conspiracy theories… particularly when I get to make one up! Here is my conspiracy theory for how Microsoft will try to force both the W3C and other browser makers to adopt IE8‘s XDomainRequest mechanism/API.

A bit of background first: the Web Applications Working Group (WAF) has been working on a spec that allows browsers to do cross-domain requests (basically for creating mashups securely). The spec is called Access-Control, and has been in development for three years. The spec was being edited by Anne van Kesteren of Opera Software, but under heavy influence from Hixie of Google, Jonas Sicking from Mozilla, and Maciej Stachowiak from Apple, to name a few people/companies. Marc Silbey, the representative from Microsoft to the working group, was also participating for a while, but he dropped off the radar as Microsoft shifted into high gear during development of IE8 (actually, Microsoft assigned 3 people to participate in WAF, but only Marc did). A few months ago, to coincide with the release of the IE8 beta, Microsoft announced XDomainRequest… aspects of which look, in a lot of ways, very similar to Access-Control, but with some key differences.Then, to the shock of the working group, they brought XDomainRequest to the W3C for standardization knowing full well that WAF had been working on Access-Control for over three years!

Naturally, Microsoft’s actions pissed a lot of people off because, as I stated in an email, they are just ignoring over three years of work into the Access-Control spec, they created their own proposal and implementation in secret and now are attempting to fast track it through standardization ignoring due process.

To which, Sunava Dutta, from Microsoft, responded by saying “incorrect” and prompting Chris Wilson, Chief Architect of IE, to respond:

You know, there is an idea that perhaps we’re not IGNORING the work on Access Control, and perhaps we simply disagree with some of it.

Which prompted me to respond:

…If Microsoft would have found the time to collaborate [in the WAF WG], all this stuff could have been resolved progressively and the [Access-Control] spec would probably be done by now (as has been shown, the MS proposal has just as many issues, if not more, than the Access-control spec; so trying to do it in-house did not yield a more adequate solution).

Which beckons the question, why did Microsoft stop participating in WAF to go off and create their own version of access-control? And here is the conspiracy theory:

  1. Microsoft joins the WAF working group in 2007
  2. Microsoft “borrows” Access-Control idea
  3. Microsoft implements its own XDomainRequest mechanism in IE8beta
  4. Mozilla implements Access-Contol in FireFox 3, but then pulls the feature at the last minute (consequently leaving a gap in the cross-domain request space for Microsoft to jump in)
  5. Microsoft delays Access-Control work by sending in comments a year late (just before it was about to go to Last Call) and putting in their XDomainRequest proposal for standardization. Meanwhile…
  6. Microsoft rolls out IE8, quickly gains market share (no help from Vista, of course 🙂 )
  7. Other browsers must now implement Microsoft’s solution/spec because business and developers start using it
  8. Microsoft’s spec become a W3C Recommendation, Access-Control spec dies in the ass.

We are currently at point 5, with Microsoft using delay tactics to slow down standardization of Access-Control.

Why do I care? I’ve only contributed to Access-Control from the sidelines by attending face-to-face meetings and asking Anne dumb questions. However, a lot of C02 has been wasted flying everyone to meetings to talk about this spec; that’s thousands of dollars and thousands of kilos of C02 going to waste. Another thing that annoys me is, as I already stated, that Microsoft has every chance to provide feedback to the working group to fix/discuss any issues they’ve had with the Access-Control spec.