[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Transliteration for use in UTF-8 locales



Jean-Marc Desperrier wrote on 2000-10-12 10:55 UTC:
> When thinking to that, I realized it's very unlikely to have both a system
> that includes this transliteration tables and to know how to use them
> properly, and not to have the adequate fonts availables.

Realistic application example: You have a modern Linux-based university
library mainframe whose bibliographic database has recently been
converted completely to Unicode. All over the University, there are
however thousands of cheap mid-1980s VT100 terminals set up that allow
users in local institute libraries to query the database. They are not
going to be replaced any time soon, because they fulfill their purpose
nicely for 98% of all users. Intensive users of non-ASCII scripts will
use modern PCs with a UTF-8 terminal emulator to telnet into the library
system, but for the occasional search hit from an ASCII terminal on a
non-Latin title, it would be nice to have a transcription written out.


> Therefore I see two most likely situation for the use of this mecanism :
> - you receive a text that hold some character that do not belong to your
> usual environment, you _do not bother_ to install the proper fonts for it, or
> _you are not able_ to interpret the characters when they are displayed in
> their native form. In either case, this is strong indication that these
> characters do not belong to a language you are used to working with. This means
> that LC_TYPE will _not_ be configured to select a proper transliteration
> method.

Transliteration is today already applied in glibc 2.1.94 in the C default
locale for all non-ASCII wide characters. It results mostly in question
marks, but even without any locale setting,

  wprintf(L"Schöne Grüße!\n");

will spit out

  Schoene Gruesse!

which Germans will find far more readable on ASCII terminals than
the

  Schne Gre!

or the aborted wprintf and errno=EILSEQ that you would get on systems
without transliteration.

> In that case, it means that if transliteration is only applied when
> LC_TYPE has a very specific value, the usefulness of this system will be quite
> limited.

If no locale is set, the C locale will be used. It is a matter of
personal taste, whether C is a synonym to C.ASCII, C.ISO8859-1, or
C.UTF-8 (the C standard doesn't say anything on this for good reasons),
but in the first two cases transliteration will be applied by glibc for all
characters not found in the specified external encoding.

> On the other hand, applying transliteration without knowing if the
> user wants it, is dangerous.

Aborting a wprintf() with errno=EILSEQ is potentially even far more
dangerous than that (because hardly anyone checks and it is even unclear
how the software should react if this is checked at all), so I think
that transliteration is a nice example of smooth degradation and robust
engineering.


> In some cases, we need a "several to several" convertion for the
> transliteration
> of japanese.

You are probably referring to context sensitive transliteration. No, we
definitely don't want to do that in the C library. It would cause
endless implementation problems (efficiency, buffering, etc.), and it is
not worth the trouble. As you said, transliteration is just an emergency
mechanism for smooth degradation of the system's performance if it
reaches its limits. We just want to be slightly more convenient than
transliterating all unavailable characters into a "?". If you are not
happy with what primitive context-free transliteration in mbrstowcs()
gives you, then simply use a Unicode locale in order to avoid
transliteration entirely.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/