[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Proposal for 2 Byte Unicode implementation in gcc and glibc
Jamie Lokier <egcs@xxxxxxxxxxxxxxxxxxxxxxxx> writes:
> Nuesser, Wilhelm wrote:
> > PS: When UTF-8 is used, the complexity of variable width characters
> > shows up with almost every commonly used language except pure 7-Bit
> > ASCII. For a number of languages, the UTF-8 representation saves some
> > storage when compared with UTF-16, but for Asian characters UTF-8
> > requires 50% more storage than UTF-16. We do not consider UTF-8 as
> > advantageous for text representation in the memory. It may be well
> > suited for files where access is sequential but in general it is no
> > universal solution.
[...]
> But I don't see the point in an extensive set of printfU16
> etc. functions. Standard unix text files use UTF-8 (or unfortunately
> they are often ISO-8859-1). Non-standard formats like databases may use
> UTF-16, but databases don't use printf to write to the database.
But if you have an application which needs to process huge amounts of
unicode enabled strings, UTF16 is IMHO the best way:
You very seldom have surrogate in normal strings and on the other side
you can detect these pairs with little overhead. We really do not have
a chance either to blow up our memory requirements to 300% compared to
the ASCII case or always scan the whole string for escape sequnences
which would mean increasing the CPU overhead by probably the same
amount.
(It took me a long time to understand this and I had the very same
point of view like you expressed, but our NLS experts really convinced
me.)
Greetings
Christoph
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/