[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Proposal for 2 Byte Unicode implementation in gcc and glibc



Title: RE: Proposal for 2 Byte Unicode implementation in gcc and glibc


> -----Original Message-----
> From: Ulrich Drepper [mailto:drepper@xxxxxxxxxx]
...
> Florian Weimer <fw@xxxxxxxxxxxxx> writes:
>
> > UTF-32 isn't fixed width either (think of combining characters).  To
> > be honest, I don't see your point here.
>
> Don't talk such a nonsense.  Combining characters and surogates are
> not at all comparable.

There are, however, some similarities. E.g., a font may treat
letter+combining character in the same way as an automatically
formed ligature, and likewise treat high surrogate+low surrogate
in the same way as an automatically formed ligature.  This is what
OpenType (with uniscribe) and AAT (integrally) fonts (will) do.

Aside: For many scripts, ligature formation is not optional,
but must be done.  E.g., the Arabic script, as well as all of
the Indic scripts.

> The functions which have to handle character
> properties with wchar_t can and should expect precomposed input
> somthing which is not at all possible with UTF-16.

???

Whether "precomposed input" (or to be more precise, input in Normal
Form C; where you in general WILL find combining characters!!) is
used or not is *completely orthogonal* to the issue of whether UTF-8,
UTF-16, UTF-32, or even SCSU is used for the string representation.

>  But why discussing
> all this?  There will be no first order 16bit UCS2/UTF-16 support.

???

Where will there not be "first order" UTF-16 support??  Java (also
on Linux) is targeted to use UTF-16 for the string datatype (it
already uses UCS-2, and will NOT begin to use UTF-32 for the string
datatype).  And it's not quite clear whether Ada's Wide_Character
and C's wchar_t, which is particularly unclear, is UCS-2 characters
(bad idea), UTF-16 code units (as it is in Win32), or UTF-32 characters;
(though they may apparently sometimes be something completely different;
bad idea).

                Kind regards
                /kent k