[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Automatic encoding guessing
Followup to: <Pine.LNX.4.31.0110231808590.24829-100000@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
By author: Markus Kuhn <mgk25@xxxxxxxxx>
In newsgroup: linux.utf8
>
> Depending on the amount of effort, you can distinguish different
> encodings quite well as long as the text is long enough for the usual
> cryptoanalytic techniques for breaking substitution ciphers to work
> (which means usually >500 characters):
>
> - UTF-8 follows strict rules and every other encoding (except for the
> UTF-8 subset ASCII, which usually hasn't to be distinguished)
> will contain either malformed UTF-8 sequences (when it's an
> 8-bit encoding) or ISO 2022 sequences (when it's a CJK
> encoding), both of which make it pretty unlikely that a
> non-UTF-8 encoding is mistaken for a UTF-8 encoding.
>
I have had data corruption because of the above assumption (some
versions of Tcl seems to make it) -- there are legal ISO-8859-x
sequences which are also legal UTF-8 sequences.
> - EUC files similarly have characteristic byte sequences that are not
> allowed in these encodings, such as unpaired GR bytes.
>
> - ISO 8859 files should be free of C1 and most C0 codes (except
> for the usual LF/TAB).
I have also had Emacs 20 garble data because of the above assumption
:(
Please, people; remember that heuristics are just that and can't be
blindly trusted :(
-hpa
--
<hpa@xxxxxxxxxxxxx> at work, <hpa@xxxxxxxxx> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <amsp@xxxxxxxxx>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/