[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Byte-order-marks considered harmful




[Sorry for the delay, I have been on holidays]

Kai Henningsen wrote:

> > Terminal emulaters would not need to deal with a BOM.
> 
> Uh, it's a perfectly legal char. The full listing is:

I meant: Terminal emulators don't have to display it.  It's very close to
ignoring this character.  What happens with copy&paste is a difficult issue
though.  Do you copy what you see displayed or what has been sent to the
emulator?  You don't see the ZWNBS.  You also don't see cursor positioning
commands, and you certainly don't want to copy those.  Just copying what is
visible will often be what the user expects (and that might be difficult
enough).

> > >  - The Unix kernel #!/bin/sh mechanism will break, because the
> > >    file will not start any more with #!
> >
> > Good point.  Putting the BOM in the second line would work.  But that's a
> > bit strange.  It would be better to adjust the kernel to handle UTF-8 files,
> > and thus ignore the BOM in this position.  Just one more place that needs to
> > be UTF-8 aware, not a big deal.
> 
> I suspect this is indeed a big deal; I do not expect kernel developers to  
> be willing to change this.

Hmm, they will have to deal with a UTF-8 file name anyway.

> Besides, it's not only the kernel. For example, a BOM at the start of a  
> RFC 822 mail is definitely illegal.
> 
> It's typically a problem in text file formats defined for ASCII that have  
> non-trivial requirements. There are a lot of these around.
> 
> For another example, think of
> 
>         for i in $( cat textfile ) ; ...
> 
> Personally, I seriously doubt this will fly.

There are some problems with putting a BOM at the start of the file.  I'm
starting to think that putting a BOM at the end of the file will remove some
of these problems (but not all, and will make things more complicated when
reading text from a stream).

On the other hand, the problem of having to manually select the encoding of
each file might still be a bigger problem.  Users like to do "vim file" and
get it displayed correctly, without having to worry about setting the correct
encoding.

> Let's be reasonable. The whole *point* of using UTF-8 is to remain ASCII- 
> compatible. Putting in BOMs is not ASCII-compatible.

I don't understand this remark.  If the file only contains ASCII characters
(<128) there should be no BOM, since the UTF-8 file is equal to the ASCII
file, thus it's not really UTF-8 encoded.  When there are non-ASCII
characters, the file is not ASCII compatible and the BOM can be used.

> Let's just drop that particular idea, please.

I'm still not satisfied with the solution of letting the user select the
encoding manually.  This is important enough for me to keep on searching for
the best solution.  Knowing what problems each solution has is very helpful.

--
~
~
~
".signature" 4 lines, 50 characters written

--/-/---- Bram Moolenaar ---- Bram@moolenaar.net ---- Bram@vim.org ---\-\--
  \ \    www.vim.org/iccf      www.moolenaar.net       www.vim.org    / /
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/