[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Use of UTF-8 under Perl and Unix
Markus Kuhn wrote:
> Bram Moolenaar wrote on 1999-11-03 00:12 UTC:
> > It seems we agree at least on the part of automatic detection not being
> > reliable enough. Which is exactly why it would be so nice if UTF-8 files
> > _can_ be detected reliably! Sorry, I'm repeating myself...
>
> The technique that mined98 uses seems to be fairly reliable. In
> practice, >98% of all ISO 8859 files contain malformed UTF-8 sequences
98% isn't very reliable. I would aim for 99.9% at least.
> if interpreted as an UTF-8 file. For example: Every single non-ASCII
> byte that is surrounded by two ASCII byte is a sure indicator that this
> is not a UTF-8 file. UTF-8 files can pretty reliably be recognized by
> searching for malformed UTF-8 sequences and not finding any.
>
> The reliable autodetection of UTF-8 is therefore not the problem,
> because UTF-8 files have a very characteristic structure and even very
> short ISO 8859 and JIS files almost certainly contain byte sequences
> that exclude UTF-8 as a potential encoding. The autodetection of other
> encodings is much more difficult.
This kind of detection can still be used for files that don't have a BOM.
It is clear that adding a BOM will make detecting UTF-8 files more reliable.
You gain that extra 1.9% that avoids a lot of frustration.
But now the other way around: What is the disadvantage of adding a BOM to the
start of an UTF-8 file? If there is no real disadvantage, we could just add
the BOM, right?
--
hundred-and-one symptoms of being an internet addict:
72. Somebody at IRC just mentioned a way to obtain full motion video without
a PC using a wireless protocol called NTSC, you wonder how you never
heard about it
--/-/---- Bram Moolenaar ---- Bram@moolenaar.net ---- Bram@vim.org ---\-\--
\ \ www.vim.org/iccf www.moolenaar.net www.vim.org / /
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/