[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: current idea
David Starner wrote:
>
> On Sun, Nov 04, 2001 at 06:05:15PM -0800, H. Peter Anvin wrote:
> > Unicode are *currently* committed to fitting within 20.1 bits -- but
> > they were equally committed to fitting within 16 bits :(
>
> It didn't take long for them to realize that 16 bits was not enought;
> Unicode 2.0 was out in July of '96. But it took them almost 5 years
> (until March 2001) to actually need to go outside 16 bits.
Yes, the handwriting on the wall for 16 bits was clear from the very
beginning, primarily due to CJK.
Unicode 1.0 (1991) assigned 34,001 code values (TUS3.0, p. 974), of
which 20,902 Han characters, leaving 31,535 (65,535 - 34,001)
unassigned code values. (I use the 20,902 figure and not count the
compatibility ones starting at U+F900.)
The _Kangxi Zidian_ (1716) dictionary from China, one of the four
dictionaries that was used to populate Unicode, contains 47,035
Han characters. 47,035 - 20,902 = 26,133 unencoded (as of 1991;
this is now all satisfied with Unicode 3.1).
Similarly, the _Dai Kanwa Jiten_ (1984 ed.) dictionary from Japan,
49,964 Han characters. Also the _Hanyu Da Zidian_ (1986) dictionary
from China, 54,678 Han characters.
The 1986 ed. of CNS 11643 (Taiwan) contained 26,539 Han characters.
Also, the CCCII character set from Taiwan, as of 1989, contained
75,684 Han characters. (Out of a possible space of 94^3 = 830,584
codepoints, although most will never be populated due to the
constraints on relationship between what each codepoint is used for.)
Even if we assumed a lot of optimism about Han unification, 16 bits
wouldn't have been enough. (And even if one knew nothing about
character sets or 20th century dictionaries, the East Asian "man on
the street" knows of the famous 18th c. _Kangxi Zidian_--sort of like
a CJK Webster's or OED--and how it has about 50,000 Han characters.)
It gets slightly worse. The following year, the 1992 ed. of CNS 11643
(Taiwan) encodes 48,027 Han characters, and 600+ other characters.
Also, the _Zhonghua Zi Hai_ (1994) dictionary from China (some of its
characters are from the original version of Unicode) contains 85,568
Han characters.
> Note that with every forseen possible addition to Unicode, including
> Egyptian Hiroglyphics and the like, that there's still over 700 thousand
> free spaces for characters left in those 20.1 bits.
There aren't that many hieroglyphs anyway... :/
> > It's easy to chew up planes if you have to do something systematic...
>
> It can't be "idiosyncratic, personal, novel, rarely exchanged, or
> private-use", and it can't be a "[g]raphology unreleated to text".
As Peter pointed out, restrictions can change. I'll add to that the
fact that Braille and music notation used to be forbidden in Unicode,
too,
although their impact on space has been relatively small.
> Within those restrictions, and the fact that they won't add hundreds of
> thousands of characters without very very good reason, what's going to
> be added to Unicode to take up 700 thousand characters?
Maybe something utterly idiotic and wasteful like encoding every
possible
combination and arrangement of Han character components (like was done
for
modern Korean jamo) so that new extensions don't have to be added every
so often? Or extreme disunifying Han characters?--oh, let's say some
70-80,000 times each of the eight Asian Han character using countries?
Or
some earth-shattering thing like every person with Han characters in
their
names switching to unique personal vanity characters and having them
encoded?--some 2-3 billion for Chinese names alone?
I really don't see it happening, either. :) Even if it does, its not
our problem. If we start second-guessing the decisions made by the
creators of Unicode too much, then we might as well ignore all their
expertise and throw the whole thing out.
Thomas Chan
chan.200@xxxxxxx
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/