[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Automatic encoding guessing
Tue, 23 Oct 2001 18:40:27 +0100 (BST), Markus Kuhn <mgk25@xxxxxxxxx> pisze:
> - You can do a bit more with character and tuple frequency
> analysis. You need for various languages (English, German,
> French, C, Lisp) and their transliterations a library of
> frequency tables for the various UCS characters/pairs,
> and then you try all Something->UCS conversions
> until you find the best match of the resulting histogram
> with one in the library (read up on "index of coincidence"
> [Friedman, ~1920] in introductory cryptanalysis textbooks
> such as Stinson).
I've done this (using frequencies of single letters only). Always
worked in practice when I needed it.
The program at
<http://qrczak.ids.net.pl/programy/linux/konwert/konwert-1.8.tar.gz>
contains it (it's really old and rusty, haven't got time to polish it).
Usage: e.g.
konwert any/pl-iso2
Currently supported languages are cs de el eo es fr he it pl pt ru sv,
each in a couple of encodings. For Latin-based scripts it makes use of
frequencies of only non-English letters of course.
--
__("< Marcin Kowalczyk * qrczak@xxxxxxxxxx http://qrczak.ids.net.pl/
\__/
^^ SYGNATURA ZASTĘPCZA
QRCZAK
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/