[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Byte-order-marks considered harmful
Markus Kuhn wrote:
> Bram Moolenaar wrote on 1999-11-03 12:38 UTC:
> > But now the other way around: What is the disadvantage of adding a BOM to
> > the start of an UTF-8 file? If there is no real disadvantage, we could
> > just add the BOM, right?
>
> Just a few examples:
Good, this is the info I was looking for.
> - None of the existing UTF-8 implementations that I know ignores U+FEFF.
> They all treat it like a spacing character for which the glyph is
> missing in the font. xterm will show the usual dotted rectangle.
> Different terminal emulators will disagree about the cursor position
> after a BOM.
Are you saying that these UTF-8 implementations are actually broken, and we
have to fix it in the text files? That doesn't make sense to me.
Terminal emulaters would not need to deal with a BOM. They need to have some
termcap/terminfo sequence or compile/runtime time option to switch it to the
right encoding. Using a BOM for that doesn't sound like a good idea.
> - The Unix kernel #!/bin/sh mechanism will break, because the
> file will not start any more with #!
Good point. Putting the BOM in the second line would work. But that's a bit
strange. It would be better to adjust the kernel to handle UTF-8 files, and
thus ignore the BOM in this position. Just one more place that needs to be
UTF-8 aware, not a big deal.
> - The BOM will get lost even in trivial file processing operations
> such as grep, tail, etc.
In that case we need to fall back to the original autodetection. It does not
mean that the BOM can't be put in the original text file. It is still useful
for Vim, for example.
> - If the BOM gets accidentally added to a Postscript file, it
> won't start any more with %!PS-Adobe-2.0 and the postscript filter
> of our printing system will not become activated, resulting in the
> postscript commands being printed.
Did you try this? Printers always have a hard time figuring out where the
start of a file is anyway (I have worked on a job control language and ran
into this problem). Also, would a postscript file contain UTF-8? I don't
recall reading something about the PostScript standard supporting UTF-8.
And if they do support UTF-8, they can ignore the BOM.
> - many streaming and piping applications will not know whether
> they contribute to the beginning of a file and will therefore
> add BOMs in the middle of already existing files, thus tampering with
> string search across the invisible BOMs, etc.
Why don't they know they are not at the start of the file? I can't think of a
real life example where a UTF-8 aware program would do this wrong. In case of
doubt, just don't insert a BOM.
The problem I do see is that when doing "cat file1 file2 >file3" you get a BOM
in the middle of the file (assuming cat doesn't know about UTF-8). If the BOM
is seen as a non-printing zero-width character it's mostly OK, but the
string searching problem still applies.
> ... and zillions other things like that, especially related to
> conventions where applications expect certain bytes at the very
> beginning of plaintext files, or where applications remove the first few
> lines of a text file.
When cutting, copying and pasting text, while being unaware of UTF-8, you can
lose the BOM. But that just means we go back to the autodetection, thus it
isn't a real disadvantage. UTF-8 aware programs will _not_ have this problem.
The problem I do see is when the BOM is in a place where it is not expected.
But how often does this happen?
> Sure, there are all fixes possible for that. However, I doubt that these
> fixes will get sufficiently widely deployed before the problem has been
> solved anyway by a general migration to UTF-8.
Back to the migration to UTF-8 again. Well, if and how fast UTF-8 is accepted
is a guess for everybody. Perhaps some better standard pops up and takes
over. Perhaps UTF-8 just gets accepted by 40% of the people. Anyway, I think
we should prepare for a mixed environment, at least for the coming ten years.
The above does show a few disadvantages to using a BOM. I'm not sure how much
of this can be solved or will be a non-problem in daily use. There are also
disadvantages to _not_ using a BOM, since autodetection can fail. How often
the autodetection fails is also a guess.
Subtract a guessed number from another guessed number, what do you get?
--
hundred-and-one symptoms of being an internet addict:
82. AT&T names you Customer of the Month for the third consecutive time.
--/-/---- Bram Moolenaar ---- Bram@moolenaar.net ---- Bram@vim.org ---\-\--
\ \ www.vim.org/iccf www.moolenaar.net www.vim.org / /
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/