[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Byte-order-marks considered harmful



Bram@moolenaar.net (Bram Moolenaar)  wrote on 03.11.99 in <199911032207.XAA11389@moolenaar.net>:

> Markus Kuhn wrote:
>
> > Bram Moolenaar wrote on 1999-11-03 12:38 UTC:

> >  - None of the existing UTF-8 implementations that I know ignores U+FEFF.
> >    They all treat it like a spacing character for which the glyph is
> >    missing in the font. xterm will show the usual dotted rectangle.
> >    Different terminal emulators will disagree about the cursor position
> >    after a BOM.
>
> Are you saying that these UTF-8 implementations are actually broken, and we
> have to fix it in the text files?  That doesn't make sense to me.

Nor to me. And it sure looks broken.

> Terminal emulaters would not need to deal with a BOM.

Uh, it's a perfectly legal char. The full listing is:

FEFF ZERO WIDTH NO-BREAK SPACE
     = BYTE ORDER MARK
     = BOM
     * may be used to detect byte order by contrast with FFFE which is not
       a character
     * may also be used as zero width no-break space
     -> FFFE <not a character>

In the second role, it's important in some languages to get correct  
layout. (Which suggests that just ignoring it is often NOT right.)

>They need to have
> some termcap/terminfo sequence or compile/runtime time option to switch it
> to the right encoding.  Using a BOM for that doesn't sound like a good idea.

Uh, of course not. If there's anything to switch at all, this would be a  
perfect place for ESC % G - and indeed, that's how the Linux console does  
it.

> >  - The Unix kernel #!/bin/sh mechanism will break, because the
> >    file will not start any more with #!
>
> Good point.  Putting the BOM in the second line would work.  But that's a
> bit strange.  It would be better to adjust the kernel to handle UTF-8 files,
> and thus ignore the BOM in this position.  Just one more place that needs to
> be UTF-8 aware, not a big deal.

I suspect this is indeed a big deal; I do not expect kernel developers to  
be willing to change this.

Besides, it's not only the kernel. For example, a BOM at the start of a  
RFC 822 mail is definitely illegal.

It's typically a problem in text file formats defined for ASCII that have  
non-trivial requirements. There are a lot of these around.

For another example, think of

        for i in $( cat textfile ) ; ...

Personally, I seriously doubt this will fly.

> The problem I do see is that when doing "cat file1 file2 >file3" you get a
> BOM in the middle of the file (assuming cat doesn't know about UTF-8).  If
> the BOM is seen as a non-printing zero-width character it's mostly OK, but
> the string searching problem still applies.

Uh, no, it's not mostly OK. Unless you define "mostly" geographically.

> Back to the migration to UTF-8 again.  Well, if and how fast UTF-8 is
> accepted is a guess for everybody.  Perhaps some better standard pops up and
> takes over.  Perhaps UTF-8 just gets accepted by 40% of the people.  Anyway,
> I think we should prepare for a mixed environment, at least for the coming
> ten years.

Let's be reasonable. The whole *point* of using UTF-8 is to remain ASCII- 
compatible. Putting in BOMs is not ASCII-compatible.

Let's just drop that particular idea, please.

MfG Kai
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/