[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
semi-automatic character set detection and conversion
Markus Kuhn wrote (Subject: Use of UTF-8 under Perl and Unix):
> To some degree
> you can do autodetection, but not to the degree where is makes manual
> selection unnecessary. UTF-8 can be failry easily autodetected, by
> checking whether no malformed UTF-8 sequences are present. The various
> ISO 8859-* sets on the other hand cannot be autodetected without
> including something short of a full spell-checker.
> ...
> I have to do this procedure manually frequently:
>
> a) make a list of potential candidate encodings
>
> b) look for a place in the file where the differences between candidate
> encodings do matter
>
> c) display these places using the various decoding alternatives
>
> d) select the one that leads to something that looks like correct spelling
> to me
Ideally we should have an Emacs mode or some other kind of GUI which would
assist the user in doing these steps while "importing" files into an UTF-8
system.
A plain text / xterm based procedure is at [1]. It's a shell script, which
you can call as
$ to-utf8 *.c *.txt
and it will convert all of the files to UTF-8, asking the user for
confirmation, using Markus' procedure.
User-assisted detection and conversion seems better than fully automatic
conversions to me because:
- Fully automatic conversions need heuristics, and in the few cases the
heuristics fail, it causes big annoyance to the user.
- We don't want such heuristics in libc (FILE and iostream) itself.
Therefore every editor/browser/application would come up with its
own heuristics, which would only add to the confusion.
Bruno
[1] ftp://ftp.ilog.fr/pub/Users/haible/utf8/to-utf8
PS: I didn't make an Emacs mode, because I don't know Emacs programming.
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/