[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: intelligent charset recognition for irc



If you want a simple and easy-to-understand algorithm, just have your
program configurable with a variable that is set to a colon-separated
list of charsets, e.g. utf-8:iso-2022-jp:iso-8859-1, and try with
iconv to convert the string into UTF-8 from each of those charsets in
turn until the conversion succeeeds.

This method should work quite well with some combinations of charsets,
and fail completely with others.

The good thing about it is that it embeds no knowledge of charsets or
language into the program.

And here is an impractical proposal which has the same advantage:

Let X be the last few kilobytes of data that was displayed, in UTF-8.
For each candidate charset Ci convert the new string from Ci into
UTF-8; call the result Yi. Now, for each conversion that succeeded,
concatenate X and Yi and apply gzip to the result. Choose i such that
the length of the gzipped string is minimum.

This works because, assuming people don't interject off-topic
utterances in a different language, the correctly converted string
contains common substrings with the recent data and is compressed
better.

However, this method could lock onto the wrong charset, so you'd need
a manual override.

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/