[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: How to detect the encoding of a string?
I'm certainly not aware of a method of automatically detecting which
8-bit character set was used. However, one solution might be to put a
conversion library into zip utilities that could optionally convert file
names between character sets. Just feeding the file names and nothing
else to libiconv could accomplish that.
----- Original Message -----
From: Danilo Segan <dsegan@xxxxxxx>
Date: Thursday, June 2, 2005 4:07 pm
Subject: Re: How to detect the encoding of a string?
> Hi Simos,
>
> It's completely impossible to detect which of the 8-bit encodings is
> used without any further knowledge (for instance, of the language in
> use).
>
> To be able to actually decide for one of the many 8-bit encodings
> suitable for a language, one would also need to know language
> properties (such as frequency of each of letter in it), but it's still
> unlikely that it would work for as short strings as filenames are.
>
> If you need a formal proof of "undetectability", here's one:
> - valid ISO-8859-1 string is always completely valid ISO-8859-2 (or
> -4, -5) string (they occupy exactly the same spots 0xa1-0xff),
> e.g. you can *never* determine if some character not present in
> another set is actually used.
>
> Today at 20:16, Simos Xenitellis wrote:
>
> > P.S.
> > If you would like to experiment with your own ZIP application,
> > try
> > http://www.thranio.gr/sxolikes-
> giortes/telikes/omilies/apoxairetisthrio-logos-mathith.zip
> > The filename is encoded in CP737 (a la iconv). All open-source ZIP
> > tools (=unzip, file-roller, ark) fail to detect the encoding.
> > WinZip is able to detect the encoding.
>
> My guess is that WinZip is running on a Greek Windows, and that
> WinZip uses old IBM encodings for i18n names on them, assuming CP737
> on Greek system.
>
> Can you confirm or dispute my assumption (by eg. trying on a non-Greek
> Windows system, or just confirming that this was actually attempted on
> a non-Greek system)?
>
> Cheers,
> Danilo
>
> --
> Linux-UTF8: i18n of Linux on all levels
> Archive: http://mail.nl.linux.org/linux-utf8/
>
>
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/