[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Unicode support under Linux
Hi,
At Wed, 03 Oct 2001 15:45:31 -0400,
Richard, Francois M <Francois.M.Richard@xxxxxxxxxxxxx> wrote:
> But, is it also true to say that under Linux utf-8 Locales, all C functions
> handle properly char data representing utf-8 character encoded data? Do
> strlen, strchr, strcmp, strcpy, toupper process char data correctly when the
> Locale character encoding is utf-8? OR I need to use the wide character
> functions after specific conversion from char to wchar_t of my charatcer
> data?
Not perfectly.
* strlen
strlen counts the *number of bytes* of the given string, not the
*number of characters* of the string. Since UTF-8 is a multibyte
encoding, these two does not coincide.
* strcpy
works well.
* strchr
does not works at all, because UTF-8 character cannot be expressed
with 'char' type.
I think the simplest way to substitute all these functions is to use
wide character. Standard C library has wchar_t substitution of above
functions. And, these are conversion functions between "multibyte
character" and "wide character". Note that "multibyte character" does
not mean the character is always multibyte. It is "locale-dependent
encoding". This means that, in ISO-8859-1 locale, "multibyte character"
is ISO-8859-1. In Big5 locale, "multibyte character" is Big5. I.e.,
if you write your software using "multibyte character" and "wide character",
your software will support not only UTF-8 but also all major encodings
in the world such as ISO-8859-*, EUC-*, KOI8-*, and so on.
Explanation on wchar_t functions is available at my document
available from my signature at the bottom of this mail.
Note that wchar_t is not always UTF-32, though it is always true in
GNU libc. If you have to write portable software, you must not assume
wchar_t is UTF-32.
---
Tomohiro KUBOTA <kubota@xxxxxxxxxx>
http://www.debian.or.jp/~kubota/
"Introduction to I18N" http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/