[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Transliteration for use in UTF-8 locales



Jean-Marc Desperrier wrote on 2000-10-10 15:48 UTC:
> Will iconv automatically use that mecanism ?

I haven't looked into iconv yet and don't know whether it can do
transliterations. Iconv is not locale-dependent, so the use of
transliteration would have to be specified somehow in the encoding
selection strings supplied to it, and I don't know of any standard
convention for that. We probably should start thinking about defining
one one.

If transliteration is defined in the locale, that it must be applied by
all C library functions that convert wide characters into multi-byte
characters as specified by LC_CTYPE. These include

  a) string conversion functions such as wcrstombs()
  b) wide output functions such as fputws() and wprintf()
  c) byte output functions such as printf() if the format specifier
     is %lc and %ls to signal that a wide character or wide string
     is the argument

As of glibc 2.1.94, b) works already nicely, but a) and c) are still not
implemented. When I reported that a) and c) don't work, glibc maintainer
Ulrich Drepper replied to my surprise that in his opinion this is not
required by the standard. I then quoted chapter and verse of ISO C99
that very clearly said the very opposite (namely that conversion has to
be done identically in all three cases above, i.e. each time with or
each time without transliteration) and also argued on the severe
technical problems that an inconsistent application of transliteration
would cause, but I haven't heard back from him yet on whether he agrees
now.

> How many transliteration tables will be included in standard ?

We'll see. I am putting together a big default table, and I will
probably also soon have to add a number of language-specific extensions
that people then can merge in easily. These extension tables might be
necessary to handle the different, often language and culture dependent
transliteration conventions. For example the dotted consonants are ASCII
transliterated in Ireland by putting an h after the base character, but
the same dotted consonants are also used in Chinese romanization
schemes, where base character + h might come somewhat as a surprise to
users. Similarly, Greek users and mathematicians can have very different
requirements for transliterating the same letter, so there definitely
will never be one single transliteration table that can make everyone
happy.

Whether any and if and how much of these tables will get into glibc
2.1.96 remains to be seen. At the moment, glibc contains an example
locale called "i18n" that is listed in ISO draft TR 14652. This example
locale contains a tiny example transliteration table that covers only a
few Danish and German characters and it not at all useful as a generic
starting point.

It is IMHO not that essential, whether the transliteration tables come
as a standard with the system distribution, because you can always very
easily cook up your own one, without getting root involved. However, it
is extremely essential that the transliteration mechanism is properly
supported in the C library code that comes with the system, because the
C library is quite non-trivial to upgrade and install later (at least
under Linux).

> What is the method to choose the kind of translitteration, create new
> methods ?

The method to chose transliteration is to chose the LC_CTYPE component
of your locale. In other words, the same technique you use to chose the
multi-byte encoding. Transliteration is really nothing but an add-on to
the multi-byte encoding, however one that only affects output, not
input.

So in bash, you just prefix your program invocations with an LC_CTYPE
or LANG assignment, as in

$ LANG=de_DE.ASCII ./test
Schoene Gruesse!
$ LANG=de_DE.ISO-8859-1 ./test
Schöne Grüße!

and you can switch between different transliteration conventions for
locale-dependent multi-byte output.

Locales are compiled files residing usually in a system-wide directory
/usr/share/locale, but you can also easily compile your own locale
definition file using the localedef tool. I did this for en_GB.UTF-8
with

  localedef -v -c -i en_GB -f UTF-8 $HOME/local/locale/en_GB.UTF-8

and then in order to inform libc about where it can find my new
home-made locale definition files, I had to use

  export LOCPATH=$HOME/local/locale

man localedef

> For example in the case of japaneses kana, two different transliteration
> methods to latin alphabet are commonly used, hepburn and the japanese
> official method.

References are welcome. I might get them both added to my collection as
separate files, and out of these, you then can cook together your own
transliteration locale tailored according to your personal preferences.

Hope that helped ...

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/