Firstly, as far as I can tell, ZipLib does everything right with regards to character encoding in filenames and comments. The PKZip specification is clear that the encoding should be IBM Code Page 437 unless the Language Encoding Flag (EFS for some reason) AKA Bit 11 is set, in which case it should be UTF-8. There are other relevant extensions to the specification: one for duplicate UTF-8 names and one that allows you to specifiy the encoding in an extra field, but I have not come across examples in the wild.
Files compressed with Mac OS X use UTF-8 for all filenames, but do not set the EFS bit, which makes them non-compliant and amiguous. I'm sure they are not the only one who play fast-and-loose with the standard, but they are the one that brought me here.
The best we can do is to guess when filenames use UTF-8 based on heuristic analysis. I used Utf8Checker and modified the helper function in ZipConstants thusly:
public static string ConvertToStringExt(int flags, byte[ data)
{
if ( data == null ) {
return string.Empty;
}
// Compliant ZIP files set this flag if the filename uses UTF8
var useUtf8 = (flags & (int)GeneralBitFlags.UnicodeText) != 0;
// Some programs, such as OSX zip, don't set the flag, but use UTF8 anyway,
// so a best guess must be made of the filename encoding
if (!useUtf8) {
useUtf8 = Unicode.Utf8Checker.IsUtf8(data, data.Length);
}
// Is the filename UTF8 encoded?
if (useUtf8) {
return Encoding.UTF8.GetString(data, 0, data.Length);
}
else {
return ConvertToString(data, data.Length);
}
}
Of course there are other encodings, but there is no definative way to deduce them, and outside of legacy files, one would expect non-ANSI encoding to be UTF-8.
I've written this post mostly for myself to find in the future when I have the same problem again and have forgotten the solution.
References
- http://stackoverflow.com/questions/13261347/correctly-decoding-zip-entry-file-names-cp437-utf-8-or
- http://utf8checker.codeplex.com/
- http://www.pkware.com/documents/casestudies/APPNOTE.TXT