[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Transliteration for use in UTF-8 locales



Markus Kuhn wrote:

> Intensive users of non-ASCII scripts will
> use modern PCs with a UTF-8 terminal emulator to telnet into the library
> system, but for the occasional search hit from an ASCII terminal on a non-Latin
> title, it would be nice to have a transcription written out.

This is similar to the first case I described.

> Transliteration is today already applied in glibc 2.1.94 in the C default
> locale for all non-ASCII wide characters.

OK. I didn't understand that.

> Aborting a wprintf() with errno=EILSEQ is potentially even far more
> dangerous than that (because hardly anyone checks and it is even unclear
> how the software should react if this is checked at all), so I think
> that transliteration is a nice example of smooth degradation and robust
> engineering.

I was thinking of the case you spoke about earlier : The locale is utf-8 able, but
the user wishes  to do transliteration.
But in that case, I understand now he will have to explicitely ask for it.

> You are probably referring to context sensitive transliteration.

Forward context only.

> No, we
> definitely don't want to do that in the C library. It would cause
> endless implementation problems (efficiency, buffering, etc.), and it is
> not worth the trouble.

Proper transliteration of japanese kana can be done with only the knowledge of the
value of the character after the one that's currently displaying, I think.

But after all, a real japanese text will have also chinese characters in addition to
kana, the chinese characters should be transformed to a phonetic equivalent before
transliteration, and you would need very large, context dependent tables to do that
(plus it would not work in every case), it's simply not feasible.

And when I say that, I do not even take into account the fact there's probably no way
to know if the word displayed is japanese or chinese.

> If you are not
> happy with what primitive context-free transliteration in mbrstowcs()
> gives you, then simply use a Unicode locale in order to avoid
> transliteration entirely.

OK; glibc can not be turned into a full-fledged transliteration library.

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/