[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Transliteration for use in UTF-8 locales
Markus Kuhn wrote:
> Intensive users of non-ASCII scripts will
> use modern PCs with a UTF-8 terminal emulator to telnet into the library
> system, but for the occasional search hit from an ASCII terminal on a non-Latin
> title, it would be nice to have a transcription written out.
This is similar to the first case I described.
> Transliteration is today already applied in glibc 2.1.94 in the C default
> locale for all non-ASCII wide characters.
OK. I didn't understand that.
> Aborting a wprintf() with errno=EILSEQ is potentially even far more
> dangerous than that (because hardly anyone checks and it is even unclear
> how the software should react if this is checked at all), so I think
> that transliteration is a nice example of smooth degradation and robust
> engineering.
I was thinking of the case you spoke about earlier : The locale is utf-8 able, but
the user wishes to do transliteration.
But in that case, I understand now he will have to explicitely ask for it.
> You are probably referring to context sensitive transliteration.
Forward context only.
> No, we
> definitely don't want to do that in the C library. It would cause
> endless implementation problems (efficiency, buffering, etc.), and it is
> not worth the trouble.
Proper transliteration of japanese kana can be done with only the knowledge of the
value of the character after the one that's currently displaying, I think.
But after all, a real japanese text will have also chinese characters in addition to
kana, the chinese characters should be transformed to a phonetic equivalent before
transliteration, and you would need very large, context dependent tables to do that
(plus it would not work in every case), it's simply not feasible.
And when I say that, I do not even take into account the fact there's probably no way
to know if the word displayed is japanese or chinese.
> If you are not
> happy with what primitive context-free transliteration in mbrstowcs()
> gives you, then simply use a Unicode locale in order to avoid
> transliteration entirely.
OK; glibc can not be turned into a full-fledged transliteration library.
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/