[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Character set tagging considered harmful



Markus.Kuhn@cl.cam.ac.uk (Markus Kuhn)  wrote on 18.09.99 in <E11SNbm-0004EV-00@heaton.cl.cam.ac.uk>:

> Bram Moolenaar wrote on 1999-09-18 12:23 UTC:
> > I wonder, is UCS-4 the maximum that is in use today?
>
> More than that.

You mean less.

> UTF-16 is an extension of UCS-2 that uses a pair of 16-bit characters
> from a high and low surrogate area in UCS-2 to represent characters in
> planes 1 to 16 (U+010000 to U+10FFFF). UTF-16 can cover a bit over 1

And if you design software and have any choice at all, avoid UTF-16 like  
the plague. Because that's what it is.

It's an ugly hack to make people think they can get away with UCS-2  
because they'll "just" implement UTF-16 when they need those extra chars.

Don't do that.

Use UTF-8 and UCS-4 exclusively.

What's the point of avoiding multi-byte characters, and thus getting  
incompatible with ASCII, when you *still* have to handle multi-word  
characters?

At least UTF-7 solves a real problem (in an ugly way), that of keeping  
both (most of) ASCII and staying within 7 bits.

> million characters. It has been agreed between the Unicode consortium
> and ISO that they will never standardize a character with a code >
> U+10FFFF. So UTF-16 will be able to encode everything that will come in
> the future. A code range of 1 million is commonly considered to be more
> then good enough. Plenty of room for contact with
> extraterrestrials ... ;-)

Well, if we use planes 0, 1, 2, and 14, then within this space, and  
assuming ETs need about as many characters as we do, we have place for  
about three types of ETs. Not all that many.

If, OTOH, we're using 16 planes but are willing to put ETs outside the UTF- 
16 range, we can handle about 2000 types.

> > I need to reserve space for each character, thus I
> > would like to know if 4 bytes is enough.

> > The UTF-8 encoding might be longer, of course.
>
> No. Better have another careful look at how UTF-8 really works:

Yes.

> UTF-8 has no way of encoding characters more than 31-bit long.
> A 32-bit integer will be able to hold the value of any legal
> UTF-8 sequence.

But the UTF-8 sequence in question can be 6 characters long, IIRC.

Of course, the UTF-16 range can be covered with a maximum of 4.

> > Are you saying that it's not possible to detect UTF-8 encoding reliably?
> > Well, that's something that needs to be worked on!

Detecting stuff reliably is unreliable *in principle*. It's the computer  
equivalent of the Heisenberg thing. You can't *reliably* detect a GIF,  
either.

In the case of UTF-8, a pure-ASCII file is perfect UTF-8. Think this over.

What you *can* do is detect that a text *cannot* be UTF-8, because it  
contains illegal sequences. It's actually a pretty good detector; it's  
fairly unlikely that anything else matches UTF-8 rules except by being  
strictly 7-bit.

MfG Kai
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/