[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Byte-order-marks considered harmful



Kai Henningsen wrote on 1999-11-07 22:09 UTC:
> Bram@moolenaar.net (Bram Moolenaar)  wrote on 03.11.99 in <199911032207.XAA11389@moolenaar.net>:
> > Terminal emulaters would not need to deal with a BOM.
> 
> Uh, it's a perfectly legal char. The full listing is:
> 
> FEFF ZERO WIDTH NO-BREAK SPACE
>      = BYTE ORDER MARK
>      = BOM
>      * may be used to detect byte order by contrast with FFFE which is not
>        a character
>      * may also be used as zero width no-break space
>      -> FFFE <not a character>
> 
> In the second role, it's important in some languages to get correct  
> layout. (Which suggests that just ignoring it is often NOT right.)

The ZWNBSP it is not used to control language layout on languages with
ligatures (Indic, most notably). There are other Unicode control codes
for this available, namely:

  U+200C  ZERO WIDTH NON-JOINER
  U+200D  ZERO WIDTH JOINER

The characters U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER
are part of collections 16-24 (DEVANAGARI, BENGALI, GURMUKHI, GUJARATI,
ORIYA, TAMIL, TELUGU, KANNADA, MALAYALAM) in ISO 10464-1:1993 and are
needed to encode these scripts, which xterm isn't handling anyway. It
has also been suggested by TeX enthusiasts that U+200C ZERO WIDTH
NON-JOINER could theoretically be used for languages such as German,
where ligatures must be suppressed across subword boundaries (as in TeX,
where you have to write "Auf{}lage" to suppress the fl ligature), but
this obviously is hardly practical, and the proper and commonly used
solution is to deactivate ligatures in fonts completely for German text.
Unlike for English, fi/fl/etc. ligatures are not commonly used in German
fine typography, so the ZERO WIDTH NON-JOINER idea is just a hack for
German users of software such as TeX that was primarily intended to
typeset English.

As long as we do not support the Indic scripts (which would require
major extensions in the X11 font conventions and would upset much of the
simple terminal emulator semantics), we also have not to worry about
free-standing zero-width characters. I suggest to treat these exactly
like non-existing Unicode characters: preserve them as spacing
characters for cut&paste and represent them in xterm with the default
character. That is already what happens now.

The ZWNBSP was really primarily introduced for the BOM hack in UCS-2,
not because it has some essential other function in encoding a language.
I am not sure exactly, what the U+200B ZERO WIDTH SPACE is good for, but
suspect that it might have potential uses in special applications such
as word processor math formula and table formatting.

I do not want to treat the ZWNBSP as a special character in terminal
emulators. I want it do get displayed and be cut&pasteable, just like
any other character for which there is no glyph in the font. If we just
silently drop ZWNBSPs in the terminal emulator (like we do with a NUL or
DEL control character), then a cut&paste will not pick up this character
again. Preserving a zero-width character, which - unlike a combining
character! - is not associated with any other glyph, for cut&paste
operations is IMHO a dubious thing. It needs nasty extra logic that has
to be replicated in every application and it is unclear what to do with
a ZWNBSP at the start or end of a cut region. It just adds complexity to
the game for no good reason.

You can't easily treat the ZWNBSP character in fixed-width terminal
emulators at the same time as a full character and keep it invisible. It
just falls out of the concept of a fixed-width terminal, therefore I'd
very much prefer to ignore its existance (just like all the other
Unicode control characters in the U+20XX range), and treat it exactly
like an unknown Unicode character. I very much believe (and hope), that
people don't really want to use these things in Unix plain text files.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/