Zip files and Encoding – I hate you.

I’ve written about some of the issues with depending on zip as a packaging format in the past. As people know, Web Apps is depending on Zip as the packaging format for Widgets.

Zip the good

Zip has a lot going for it. It is ubiquitous and dependable… so long as you don’t want to share files across cultures.

Zip the bad

The Zip spec does not seem to know that there are normalization models for UTF-8, when there are actually 4 (or more, because there is some non-standard ones too!). The Zip file gives no guidance as to how file names inside zip files are to be normalized.

Consider, when a zip file is created on Linux, it just writes the bytes for the file name in the encoding of the underlying file system. So, if the file system is in ISO-8859-1, the bytes are written in ISO-8859-1. This may seem ok, but when you decompress the zip file on Windows, which runs on encoding Windows-1252, the file names get all mangled. If the underlying encoding of the file system on Linux is something else, you won’t be able to share files with other systems at all. So in this case, it is not Window’s fault.

The Zip spec says that the only supported encodings are CP437 and UTF-8, but everyone has ignored that. Implementers just encode file names however they want (usually byte for byte as they are in the OS… see table below).

It gets worst! because MacOS runs on some weird non-standard decomposed Unicode mode, you can only share zip files with other MacOs users. According to this email, the LimeWire guys also ran into a similar problem with regards to encodings in MacOS:

“for example a French, German or Spanish Windows user cannot exchange files that contain [file names with] French, German or Spanish accents with a French, German or Spanish Macintosh users”

The following table illustrates the problem:

Bytes that represent ñ in a Zip file (in hex)
File name Zip in Windows Zip in Linux Zip in Mac OS
ñ a4 (Extended US-ASCII/CP437) C3 B1 (UTF-8 NFC) 6E CC 83 (UTF-8 NFD)

Yes! holly crap! three different byte sequences corresponding to different character encodings.

The only way around this would be a *special* custom-built widget zipping tool that normalizes file name strings to NFC. If the widget engine needs to decompress the widget to disk, then it would take the NFC and convert them to the operating system’s native encoding (or store the files in memory, and reference them that way). This affects the URI scheme and DOM normalization of Widgets, so Web Apps will have to deal with it eventually… but not sure exactly how.

14 thoughts on “Zip files and Encoding – I hate you.”

  1. I haven’t got much success with tar.bz2 either.

    There’s The Unarchiver for Mac OS X which tries to guess encoding of filenames.

    Since UTF-8 can be mostly-reliably distinguished from 8-bit encodings, I think it should be required for all decompressors.

    And NFD is Mac OS X’s problem, not ZIP’s. If some app tries to use bytes in filenames that system simply does not allow by definition, then that’s bug in the app, and app should be fixed.

    I think going forward, all ZIP-dependent specs should require filenames in UTF-8 and forbid applications from relying on any particular Unicode normalization.

  2. There’s nothing “weird” or “nonstandard” about the OS X Unicode decomposition, it’s just plain NFD as far as I know. Now, if unicode decomposition was the only problem, this would all be trivial.

    But the real problem is the already mentioned ISO-8859-1, Windows-1252, CP437, and the as-yet unmentioned Shift_JIS, EUCKR, Big5, ISO-8859:s 2 through 15 or however many there are, and so on, and so on.

    Really, the only way to reliably open a zip file is to either ask the user for the character encoding (and he probably doesn’t know), or to try and autodetect it.

    I’ve had some success using Mozilla’s universalchardet to open Zip files in http://code.google.com/p/theunarchiver/. A friend is currently also helping getting some of the core code to run on Linux. It’s all Objective-C, though, which will probably scare people off from using it.

  3. Also, tar is a very unflexible and limited format, and not much good for any platform with filesystem metadata of any kind, which is pretty much all of them these days.

  4. Future zip executables (compressors) should assume filenames are in the system encoding (a very reasonable assumption in my opinion) and convert them to UTF-8 in the created zip files.

  5. Blame all the programmers who think that encoding doesn’t matter and refuse to get on the UTF-8 bandwagon even though the rest of the world has long since been on the bus.

  6. Windows-1252 is actually a superset of ISO 8859-1 (disregarding control characters which will not appear in file names anyway), so your example is technically incorrect: encoding as ISO 8859-1 and decoding as Windows-1252 would work perfectly fine.

    You later mention CP437 as an encoding used on Windows machines, but also say that “everyone” ignores the specification, which says that only CP437 or UTF-8 should be used. (CP437 is effectively incompatible with ISO 8859-1 and Windows-1252.) I am confused as to what encoding Windows actually uses/assumes. Do some versions use Windows-1252 and others CP437? Please clarify. (Obviously, other encodings must be used for non-Western demographics, as touched upon by another commentator, but let us leave that for now.)

  7. The unzip command has -O and -I options to specify source filename encodings.

    If archive was done on Windows u use -O option some thing like:

    unzip -O sjis yourarchive.zip

    -I option is used if you archived it on Linux/Unix with diffirent option.

  8. I used this simple utf8 normalizing sub, which solved all issues with all the archive types:

    Here’s the example:
    #!/usr/bin/perl -w

    use strict;
    use warnings;
    use bytes;
    use Encode::Detect;

    sub normalize_to_utf8 {
    return decode(“Detect”, shift);
    }

    $fname = “any_filename_from_archive_as_it_comes”;
    $fname = normalize_to_utf8($fname);

    # now $file’s encoding detected and converted to UTF8
    # ready to be used

    Hope this helps.

  9. O. Andersen: FYI, the difference between CP437 and CP1252 is that CP437 is a OEM codepage and CP1252 is the ANSI codepage. Yes, there are two codepages in use in Windows. Most GUI stuff use the ANSI one, most console stuff and other stuff that comes from DOS uses the OEM one.

  10. I have just faced similar problem of the file name encoding in MAC. My friend has sent me some files, after decompression, the file names look weird. My solution is to use another program which supports unzip with another encoding instead of using the unzip program which comes with MAC OS

Comments are closed.