UTF-8 Support in ZipLib

Firstly, as far as I can tell, ZipLib does everything right with regards to character encoding in filenames and comments. The PKZip specification is clear that the encoding should be IBM Code Page 437 unless the Language Encoding Flag (EFS for some reason) AKA Bit 11 is set, in which case it should be UTF-8. There are other relevant extensions to the specification: one for duplicate UTF-8 names and one that allows you to specifiy the encoding in an extra field, but I have not come across examples in the wild.

Files compressed with Mac OS X use UTF-8 for all filenames, but do not set the EFS bit, which makes them non-compliant and amiguous. I'm sure they are not the only one who play fast-and-loose with the standard, but they are the one that brought me here.

The best we can do is to guess when filenames use UTF-8 based on heuristic analysis. I used Utf8Checker and modified the helper function in ZipConstants thusly:

  public static string ConvertToStringExt(int flags, byte[ data)
        {
            if ( data == null ) {
                return string.Empty;
            }
            // Compliant ZIP files set this flag if the filename uses UTF8
            var useUtf8 = (flags & (int)GeneralBitFlags.UnicodeText) != 0;
            // Some programs, such as OSX zip, don't set the flag, but use UTF8 anyway,
            // so a best guess must be made of the filename encoding
            if (!useUtf8) {
                useUtf8 = Unicode.Utf8Checker.IsUtf8(data, data.Length);
            }
            // Is the filename UTF8 encoded?
            if (useUtf8) {
                return Encoding.UTF8.GetString(data, 0, data.Length);
            }
            else {
                return ConvertToString(data, data.Length);
            }
        }

Of course there are other encodings, but there is no definative way to deduce them, and outside of legacy files, one would expect non-ANSI encoding to be UTF-8.

I've written this post mostly for myself to find in the future when I have the same problem again and have forgotten the solution.

References

http://stackoverflow.com/questions/13261347/correctly-decoding-zip-entry-file-names-cp437-utf-8-or
http://utf8checker.codeplex.com/
http://www.pkware.com/documents/casestudies/APPNOTE.TXT

UTF-8 Support in ZipLib

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...