[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Transliteration for use in UTF-8 locales
One other use of the transliteration scheme is for broken spelling
systems. I am exploring one use of it for Irish, using glibc-2.1.95.
Old Irish used to use the some consonants ([bcdfmpst] with dot above)
that were not used in other alphabets. As part of a spelling reform
in the 1950's these were replaced with [letter] h, as in
(b with dot above) --> bh, etc.
This made it easier to find fonts, typewriters, etc for Irish, at the
expense of making words longer and less clear. However the old characters
are still used in places, and are now present in Unicode and ISO 8559-14,
and the Minimum European Subset of characters, so you are guaranteed to
find them in many fonts.
So it is straightforward to define transliteration schemes in locales
such that a user can use the old scheme or the new scheme at will.
Alastair McKinstry - <mckinstry@xxxxxxxxxxxx>
Crom Lar, Ballinahalla, Maigh Cuilinn, Gaillimh, Ireland
Phone: +353 91 556177, Mobile/Fax +353 87 6847928
-----Original Message-----
From: Markus Kuhn [mailto:Markus.Kuhn@xxxxxxxxxxxx]
Sent: Tuesday, October 10, 2000 2:03 PM
To: linux-utf8@xxxxxxxxxxxx
Subject: Transliteration for use in UTF-8 locales
The ISO TR 14652 transliteration mechanism (which is already partly
implemented in glibc 2.1.95) was probably primarily intended for people
whose I/O devices cannot handle UTF-8 and want to see as much as
possible of the wide character information in their 7/8-bit coding
system. If you have a UTF-8 locale, every wide character can uniquely
and without loss of information be converted into a multi-byte
character, and no transliteration seems necessary at first sight.
While doing some research for the transliteration tables that I
currently put together, it occurred to me that there is a quite second
good reason for using transliteration. Even if people work in a UTF-8
locale with fully Unicode capable I/O devices everywhere, their brain
might still not yet be fully Unicode capable and they might still want
library-level transliteration to aid in reading the text. I would find
it very convenient to have a
de_DE.UTF-8@romanized
locale, that uses the UTF-8 encoding, but nevertheless applies
transliteration (optimized for a German reader) to non-Latin scripts.
This way if say Russian, Greek, Hebrew or Arabic people write their
names in Email headers From: lines in their native script, I will still
be able to get a romanized display that helps me to guess the
pronunciation of their names reasonably well. This has nothing to do
with converting to ASCII. In fact, many of the ISO standardized
transliteration schemes add lot's of accents to the romanized output of
a transliteration, in order to minimize the loss of information.
The UTF-8 output of romanized Greek or Cyrillic text will typically
contain lots of Latin characters not found in ISO 8859 or ISO 6937.
Just something that people playing around with glibc transliteration
might keep in mind.
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/