[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Byte-order-marks considered harmful
Bram@moolenaar.net (Bram Moolenaar) wrote on 03.11.99 in <199911032207.XAA11389@moolenaar.net>:
> Markus Kuhn wrote:
>
> > Bram Moolenaar wrote on 1999-11-03 12:38 UTC:
> > - None of the existing UTF-8 implementations that I know ignores U+FEFF.
> > They all treat it like a spacing character for which the glyph is
> > missing in the font. xterm will show the usual dotted rectangle.
> > Different terminal emulators will disagree about the cursor position
> > after a BOM.
>
> Are you saying that these UTF-8 implementations are actually broken, and we
> have to fix it in the text files? That doesn't make sense to me.
Nor to me. And it sure looks broken.
> Terminal emulaters would not need to deal with a BOM.
Uh, it's a perfectly legal char. The full listing is:
FEFF ZERO WIDTH NO-BREAK SPACE
= BYTE ORDER MARK
= BOM
* may be used to detect byte order by contrast with FFFE which is not
a character
* may also be used as zero width no-break space
-> FFFE <not a character>
In the second role, it's important in some languages to get correct
layout. (Which suggests that just ignoring it is often NOT right.)
>They need to have
> some termcap/terminfo sequence or compile/runtime time option to switch it
> to the right encoding. Using a BOM for that doesn't sound like a good idea.
Uh, of course not. If there's anything to switch at all, this would be a
perfect place for ESC % G - and indeed, that's how the Linux console does
it.
> > - The Unix kernel #!/bin/sh mechanism will break, because the
> > file will not start any more with #!
>
> Good point. Putting the BOM in the second line would work. But that's a
> bit strange. It would be better to adjust the kernel to handle UTF-8 files,
> and thus ignore the BOM in this position. Just one more place that needs to
> be UTF-8 aware, not a big deal.
I suspect this is indeed a big deal; I do not expect kernel developers to
be willing to change this.
Besides, it's not only the kernel. For example, a BOM at the start of a
RFC 822 mail is definitely illegal.
It's typically a problem in text file formats defined for ASCII that have
non-trivial requirements. There are a lot of these around.
For another example, think of
for i in $( cat textfile ) ; ...
Personally, I seriously doubt this will fly.
> The problem I do see is that when doing "cat file1 file2 >file3" you get a
> BOM in the middle of the file (assuming cat doesn't know about UTF-8). If
> the BOM is seen as a non-printing zero-width character it's mostly OK, but
> the string searching problem still applies.
Uh, no, it's not mostly OK. Unless you define "mostly" geographically.
> Back to the migration to UTF-8 again. Well, if and how fast UTF-8 is
> accepted is a guess for everybody. Perhaps some better standard pops up and
> takes over. Perhaps UTF-8 just gets accepted by 40% of the people. Anyway,
> I think we should prepare for a mixed environment, at least for the coming
> ten years.
Let's be reasonable. The whole *point* of using UTF-8 is to remain ASCII-
compatible. Putting in BOMs is not ASCII-compatible.
Let's just drop that particular idea, please.
MfG Kai
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/