[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode support under Linux



On Thu, Oct 04, 2001 at 10:45:41AM -0700, Carl W. Brown wrote:
> You are right the while functions like strstr will work with UTF-8 they are much slower.  strstr compares the matching string to the source byte by byte until a mismatch.  Then it increments the source by one byte.  If this byte is a continuation character there will be no hit.  This should not be too much of a problem since it should immediately mismatch.  It is a bit slower but not too bad.

A big hit, but I wonder how much is avoidable.  The three cases for this, I
think, are: strstr (dumb, ends up comparing continuation bytes); strstr
that knows utf8 (avoid comparing those bytes); or converting to UCS-2 or
UCS-4 and doing a memcmp.

I think skipping continuations would be a speed hit--you'd be taking
the (minor) hit of UTF-8 decoding logic for every character, and all
you're saving is a few byte compares.  (Actually, a lot of byte
compares, but it's a lot less code.)

> I don't think that the extra paging due to extra memory usage is too bad.  We get bigger and faster systems every day.

I disagree--realize that if you're dealing with English text and
converting to UCS-4, you're blowing the strings up to 4x their original
size.  This is less of an issue for most other languages, of course, where
UTF-8 is bigger, but you still are adding a very expensive strcpy
(essentially, anyway--the cost of copying and conversion) for every
comparison.

In general, I don't think it's a good idea to discard the idea of
keeping code quick--and in the case of fundamental string operations,
it's still very important.  (I want faster systems to mean my programs
run *faster*; I don't want to break even. :)

> If you use wide character support you have to use it everywhere.  You can not convert a string from UTF-8 to UTF-32 and tokenize it with wcstok and expect that the results be mapped back to the original UTF-8 string.  You can to go WC all the way.  That means a lot of program constants that will also have to be changed.  With UTF-8 you don't have to change any constants that are pure ASCII.
> The big hit comes with debugging.  It is a pain to read the UTF-32 strings.  This really increases the development cost especially with non-i18n programmers who don't keep a copy of the Unicode book on their desk at all times.

This leaves many people preferring UTF-8--and leaving the above
situation (and likewise for all string ops).

Looking at the three major choices: UTF-8, UCS-2 or 4, DBCS, all seem to
have maojr pitfalls right now.  UTF-8 leaves us with slow string ops;
UCS-2 and 4 leave us with more memory usage, much harder debugging; DBCS
leaves us with an unknown string type (the program doesn't know anything
about it); both UCS-2/4 and UTF-8 mean you're doing a bunch of conversion
if you want full MBCS support, UCS-2/4 means you're doing conversions for
all I/O, and MBCS means iterating in reverse is slow.

I'd probably take the C++ route, if I wanted full MBCS support:
keep strings in their native format whenever possible, convert if needed
for speed, special case stuff like reverse iterators for UTF-8.  This
isn't really appropriate for small projects, though.  Too bad C++'s own
string class sucks.

> You can do that with xIUA and ICU.  In fact you might want to use the same sort of support with glibc.  That way if you want to go to ICU later or port to another platform you only have one piece of code to change.
> xIUA supports different encodings dynamically.  You can have a routine that gets called with EUC-JP, UTF-8 or UTF-32 data and they all are handled correctly.  You can also invoke the UTF-8 support explicitly and save the overhead of checking to see which routine to call.  If you are communication with browsers for example, they don't all support UTF-8 properly.  It even has a bonus for HTML and XML it that you can tell the converter that any character that does not convert it will automatically convert to a NCR sequence.  This way you can send Japanese with the iso-5589-1 code page and not lose a character. 

Sounds like the above.  Sounds like it might be fairly big, too, which
is annoying for most small- and medium- sized projects--most OSS
developers are very hesitant to add a major dependancy, especially the
cross-platform ones (where this will often mean shipping binary packages
along with the runtimes for the dependancy.)

-- 
Glenn Maynard
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/