[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Use of UTF-8 under Perl and Unix



Markus.Kuhn@cl.cam.ac.uk (Markus Kuhn)  wrote on 03.11.99 in <E11invF-0000lV-00@heaton.cl.cam.ac.uk>:

> Bram Moolenaar wrote on 1999-11-03 00:12 UTC:

> > > For instance, if I receive a file that might be either ISO 8859-1 or ISO
> > > 8859-15, then there should be a Perl function that cuts out a few
> > > example words of this file that contain characters where ISO 8859-1 and
> > > ISO 8859-15 differ, such that the user can decide based on a display of
> > > these example characters in various decodings what the most likely
> > > encoding was. If the file does not contain any of the characters in
> > > which say ISO 8859-1 and ISO 8859-15 differ, then the question whether
> > > the file was encoded in ISO 8859-1 or ISO 8859-15 is obviously
> > > irrelevant for converting the encoding of this file.
> >
> > I've never seen it done this way.
>
> Well, I am being paid for thinking about how things might be done more
> conveniently in the future, not for just documenting existing practice.
> :) Such an interactive encoding guessing/preview tool might often come
> in very handy, because it assists me just in what I have to do otherwise
> completely manually.

Actually, I know of a certain software that does a limited amount of this  
automatically. And it's been doing that for a long time now.

It's a certain German-language BBS system. It autodetects the charsets  
used by new users if they have any umlauts in their names; it  
discriminates several vendor charsets (DOS, Mac, 8859-1 and so forth). It  
turned out that in the supported charsets, there was exactly one case  
where a char from 80-ff was an umlaut in more than one of them.

If the guesses don't suffice, the software asks, of course; plus, it's  
always possible to do a manual override. (The preferred form of which,  
given that users tend to not know what their terminal emulator actually  
sends out, is to ask the user to type äöüÄÖÜß and to autodetect on  
*that*.)

MfG Kai
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/