[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Use of UTF-8 under Perl and Unix




Markus Kuhn wrote:

> Bram Moolenaar wrote on 1999-11-03 00:12 UTC:
> > It seems we agree at least on the part of automatic detection not being
> > reliable enough.  Which is exactly why it would be so nice if UTF-8 files
> > _can_ be detected reliably!  Sorry, I'm repeating myself...
> 
> The technique that mined98 uses seems to be fairly reliable. In
> practice, >98% of all ISO 8859 files contain malformed UTF-8 sequences

98% isn't very reliable.  I would aim for 99.9% at least.

> if interpreted as an UTF-8 file. For example: Every single non-ASCII
> byte that is surrounded by two ASCII byte is a sure indicator that this
> is not a UTF-8 file. UTF-8 files can pretty reliably be recognized by
> searching for malformed UTF-8 sequences and not finding any.
> 
> The reliable autodetection of UTF-8 is therefore not the problem,
> because UTF-8 files have a very characteristic structure and even very
> short ISO 8859 and JIS files almost certainly contain byte sequences
> that exclude UTF-8 as a potential encoding. The autodetection of other
> encodings is much more difficult.

This kind of detection can still be used for files that don't have a BOM.

It is clear that adding a BOM will make detecting UTF-8 files more reliable.
You gain that extra 1.9% that avoids a lot of frustration.

But now the other way around: What is the disadvantage of adding a BOM to the
start of an UTF-8 file?  If there is no real disadvantage, we could just add
the BOM, right?

--
hundred-and-one symptoms of being an internet addict:
72. Somebody at IRC just mentioned a way to obtain full motion video without
    a PC using a wireless protocol called NTSC, you wonder how you never
    heard about it

--/-/---- Bram Moolenaar ---- Bram@moolenaar.net ---- Bram@vim.org ---\-\--
  \ \    www.vim.org/iccf      www.moolenaar.net       www.vim.org    / /
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/