[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Use of UTF-8 under Perl and Unix



Bram Moolenaar wrote on 1999-11-03 00:12 UTC:
> It seems we agree at least on the part of automatic detection not being
> reliable enough.  Which is exactly why it would be so nice if UTF-8 files
> _can_ be detected reliably!  Sorry, I'm repeating myself...

The technique that mined98 uses seems to be fairly reliable. In
practice, >98% of all ISO 8859 files contain malformed UTF-8 sequences
if interpreted as an UTF-8 file. For example: Every single non-ASCII
byte that is surrounded by two ASCII byte is a sure indicator that this
is not a UTF-8 file. UTF-8 files can pretty reliably be recognized by
searching for malformed UTF-8 sequences and not finding any.

The reliable autodetection of UTF-8 is therefore not the problem,
because UTF-8 files have a very characteristic structure and even very
short ISO 8859 and JIS files almost certainly contain byte sequences
that exclude UTF-8 as a potential encoding. The autodetection of other
encodings is much more difficult.

> > For instance, if I receive a file that might be either ISO 8859-1 or ISO
> > 8859-15, then there should be a Perl function that cuts out a few
> > example words of this file that contain characters where ISO 8859-1 and
> > ISO 8859-15 differ, such that the user can decide based on a display of
> > these example characters in various decodings what the most likely
> > encoding was. If the file does not contain any of the characters in
> > which say ISO 8859-1 and ISO 8859-15 differ, then the question whether
> > the file was encoded in ISO 8859-1 or ISO 8859-15 is obviously
> > irrelevant for converting the encoding of this file.
> 
> I've never seen it done this way.

Well, I am being paid for thinking about how things might be done more
conveniently in the future, not for just documenting existing practice.
:) Such an interactive encoding guessing/preview tool might often come
in very handy, because it assists me just in what I have to do otherwise
completely manually.

> Can you imagine Netscape presenting you with a list of encodings for you to
> select one from, each time you open a page that looks like some ISO 8859
> text?

I would obviously object to Netscape presenting me this menu
automatically for each page. However, I would applaud Netscape for
allowing me to activate this menu on demand to help me in finding the
right encoding of an incorrectly MIME-labeled page. Think of it as an
additional forensic/preview tool for the more sophisticates character
detective, not as something that annoys people by popping up additional
windows unrequestedly.

> No, that will only annoy people.

Whether it is annoying depends on how clever the feature is designed,
used, and integrated with the rest of the tool. Just because you can
think of a particularly annoying way to present this feature does not
mean that is has to be implemented that way (or even that this is what I
had in mind).

> Sorry Markus, I don't see this manual selection as a feasible solution.
> Only when it's a user that initiates this, then it would be very useful.
> Thus, in Netscape you would have some "choose encoding" dialog, which shows
> the result of various alternatives.  Yes, that would be nice.  But not as an
> automatic appearing dialog, that is my point.

Excellent. I never suggested anything else and I am glad that you were
able to follow. (It is always nice to observe how someone step-by-step
understands and reconstructs what I was talking about while already
writing the reply ... ;-)

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/