[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Byte-order-marks considered harmful
Bram Moolenaar wrote on 1999-11-03 12:38 UTC:
> But now the other way around: What is the disadvantage of adding a BOM to the
> start of an UTF-8 file? If there is no real disadvantage, we could just add
> the BOM, right?
Just a few examples:
- None of the existing UTF-8 implementations that I know ignores U+FEFF.
They all treat it like a spacing character for which the glyph is
missing in the font. xterm will show the usual dotted rectangle.
Different terminal emulators will disagree about the cursor position
after a BOM.
- The Unix kernel #!/bin/sh mechanism will break, because the
file will not start any more with #!
- The BOM will get lost even in trivial file processing operations
such as grep, tail, etc.
- If the BOM gets accidentally added to a Postscript file, it
won't start any more with %!PS-Adobe-2.0 and the postscript filter
of our printing system will not become activated, resulting in the
postscript commands being printed.
- many streaming and piping applications will not know whether
they contribute to the beginning of a file and will therefore
add BOMs in the middle of already existing files, thus tampering with
string search across the invisible BOMs, etc.
... and zillions other things like that, especially related to
conventions where applications expect certain bytes at the very
beginning of plaintext files, or where applications remove the first few
lines of a text file.
Sure, there are all fixes possible for that. However, I doubt that these
fixes will get sufficiently widely deployed before the problem has been
solved anyway by a general migration to UTF-8.
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/