[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Transliteration for use in UTF-8 locales
Jean-Marc Desperrier wrote on 2000-10-16 08:55 UTC:
> Then the application will need a quite more sophisticated version of wcwidth
> than the one you provide freely ...
The wcwidth() on
http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
is merely intended to document what xterm is doing in UTF-8 mode and
what other UTF-8 terminal emulators are recommended to do as well. If
you use a C library to do your wide character to UTF-8 conversion, then
you definitely should also use that same C library's wcwidth() function
and not mine. This way, wcwidth() information will come from the same
source as the transliteration, which allows transliteration to be taken
into account.
There is one (minor?) open problem though: What happens if xterm one day
also uses the C libraries's wcwidth() instead of the current hardwired
one, and it does this in a UTF-8 locale with transliteration? A few
strange effects might theoretically happen, but I am not yet sure
whether they would show up in practice. For example, the C library would
provide you with wcwidth(L'ü') == 2, because of the ü -> ue
transliteration. This would naturally cause xterm to pick ü from the
wide character set, which is not desirable. But then, in that locale,
xterm should never see the ü directly, because it would be
transliterated into other characters anyway. ("ü" is perhaps not a
realistic example, because it is already Latin and unlikely to be
transliterated in a UTF-8 locale, but substitute it for some Cyrillic or
Ethiopian character that EU/US user might well want to have
transliterated for readability.)
Would it perhaps be conceptually less confusing if we agree that xterm
should in the future not make the singlewidth/doubewidth decision for
every character based in the locale-dependent wcwidth(), but only using
its hardwired built-in table?
My original idea of eventually using a locale-dependent wcwidth() in
xterm was to allow for JIS/EUC compatible UTF-8 locales, where for
example all Greek and Cyrillic letters are double-width, as they are
today in Japan on EUC terminals like kterm (because they are encoded as
double bytes). Are locales that are width-compatible with EUC legacy
codings really needed and desirable in practice? I personally find them
quite ugly and undesireable and I would now favour to simply keep
xterm's wcwidth hardwired and locale independent such that it won't get
interfered with by libc's transliteration.
Another practical problem with locale dependency in xterm that speaks
against it is that you have to restart a new xterm each time you want to
change the locale. An "export LANG" types inside the shell inside an
xterm will not affect the xterm any more, only subprocesses. So better
let's keep the width hardwired in xterm IMHO, as it is at the moment.
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/