[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

semi-automatic character set detection and conversion



Markus Kuhn wrote (Subject: Use of UTF-8 under Perl and Unix):

> To some degree
> you can do autodetection, but not to the degree where is makes manual
> selection unnecessary. UTF-8 can be failry easily autodetected, by
> checking whether no malformed UTF-8 sequences are present. The various
> ISO 8859-* sets on the other hand cannot be autodetected without
> including something short of a full spell-checker.
> ...
> I have to do this procedure manually frequently:
> 
>   a) make a list of potential candidate encodings
> 
>   b) look for a place in the file where the differences between candidate
>      encodings do matter
> 
>   c) display these places using the various decoding alternatives
> 
>   d) select the one that leads to something that looks like correct spelling
>      to me

Ideally we should have an Emacs mode or some other kind of GUI which would
assist the user in doing these steps while "importing" files into an UTF-8
system.

A plain text / xterm based procedure is at [1]. It's a shell script, which
you can call as
     $ to-utf8 *.c *.txt
and it will convert all of the files to UTF-8, asking the user for
confirmation, using Markus' procedure.

User-assisted detection and conversion seems better than fully automatic
conversions to me because:

   - Fully automatic conversions need heuristics, and in the few cases the
     heuristics fail, it causes big annoyance to the user.

   - We don't want such heuristics in libc (FILE and iostream) itself.
     Therefore every editor/browser/application would come up with its
     own heuristics, which would only add to the confusion.

                      Bruno

[1] ftp://ftp.ilog.fr/pub/Users/haible/utf8/to-utf8
PS: I didn't make an Emacs mode, because I don't know Emacs programming.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/