[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Character set tagging considered harmful
Bram Moolenaar wrote on 1999-09-18 12:23 UTC:
> I wonder, is UCS-4 the maximum that is in use today?
More than that.
The UCS-2 range is the maximum in use today. There are no characters yet
defined outside the range U+0000 to U+FFFD, which is known as "Plane 0"
(except the so-called Plane-14 tags, which are not really part of
Unicode). A plane is a 16-bit range with 2**16 code points. However,
there do exist plans to fill Plane 1 with scripts that are of historic,
cultural, hobbyist and scientific interest (Hierglyphics, Tengwar,
Klingon, Blissymbolics, very exotic mathematical symbols, etc.). These
are characters that are not urgently needed (there exists very little
practice in encoding them on computers today if any at all), but it is
nice to have them covered at least in theory as well. There are also
plans to fill Plane 2 with thousands of historic CJK characters, to
cover all characters found in some very comprehensive Asian dictionaries
(again, also character not used on computers today). So it is good to be
prepared for more than UCS-2.
UTF-16 is an extension of UCS-2 that uses a pair of 16-bit characters
from a high and low surrogate area in UCS-2 to represent characters in
planes 1 to 16 (U+010000 to U+10FFFF). UTF-16 can cover a bit over 1
million characters. It has been agreed between the Unicode consortium
and ISO that they will never standardize a character with a code >
U+10FFFF. So UTF-16 will be able to encode everything that will come in
the future. A code range of 1 million is commonly considered to be more
then good enough. Plenty of room for contact with
extraterrestrials ... ;-)
> I need to reserve space for each character, thus I
> would like to know if 4 bytes is enough.
4-bytes per character is *more* then enough per character. UCS is just a
31-bit character set after all, so a signed 32-bit int (that is what
glibc's wchar_t is) will more then do. Even 3 bytes will last forever
and 2-bytes would be OK so far if you are prepared to handle pairs of
UTF-16 surrogate values as single characters.
> The UTF-8 encoding might be longer, of course.
No. Better have another careful look at how UTF-8 really works:
http://www.cl.cam.ac.uk/~mgk25/unicode.html
UTF-8 has no way of encoding characters more than 31-bit long.
A 32-bit integer will be able to hold the value of any legal
UTF-8 sequence.
XFree86 xterm is restricted to the UCS-2 range by the way, as is the X11
font mechanism.
My advice would be to try and keep UTF-8 as the in-memory encoding. Do
not convert to a fixed-width encoding unless really necessary for
table-lookups, etc. The self-synchronizing properties of UTF-8 make this
very feasible. You can even preserve illegal UTF-8 sequences this way
such that you loose no information if you load and save a binary file
accidentally in UTF-8 mode. Mined98 is doing this nicely, as are a
number of other existing UTF-8 editors. The plan for emacs is also to
keep UTF-8 as the in-memory representation, in the interest of binary
transparency.
> Are you saying that it's not possible to detect UTF-8 encoding reliably?
> Well, that's something that needs to be worked on!
LC_CTYPE is the best detector you will ever get. It allows us so far to
distinguish ISO_8859-15 from JISX0208, and I see no reason why it should
suddenly fail on UTF-8. Everything else is just a heuristic. The
self-synchronizing properties of UTF-8 make it more feasible to write a
> 95% heuristic for UTF-8 then for other encodings, but you should be
careful to apply such autodetection ONLY when the user didn't tell you
explicitely via LC_CTYPE what the intended encoding is. The user must be
able to reliably enforce interpretation of the file as UTF-8 for
mission-critical applications, where the remaining risk of autodetection
or tagging is not acceptable.
I assure you, that UTF-8 files will not be tagged in any special way on
POSIX systems. Just like ASCII and ISO 646-Swedish files were never
tagged in any special way. Typed files are simply not the Unix way, for
very good reasons. There will be no BOM or ESC 2022 announcer, and if
there is one occasionally, it will either cause trouble or be lost after
the next cut & paste, grep, tail, conversion, etc. This stuff is not
robust in general. It might work in special restricted applications, but
not more. The world is already full of UTF-8 files. Search for UTF-8 on
dejanews, and you'll hit a hundred thousand postings, because Asian
versions of Netscape and IE have been sending out UTF-8 files for years.
> > We just want a toggle, between Mess and UTF-8.
>
> And we need to help the people that have to toggle all the time.
Exactly, by offering them an option to leave the error-prone toggling
and character-set guessing domain.
> Switching to a single encoding is not an option for most people at this time,
> since many files are Latin-1 encoded.
The files are really not the problem. Files are very easily converted
without loss of information. The problem are applications that can
structurally not yet deal with files that can contain a million
different characters. Most applications believe that there exist not
more than 256 characters. That is the real problem.
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/