[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Transliteration for use in UTF-8 locales



Edmund GRIMLEY EVANS wrote on 2000-10-13 10:52 UTC:
> > If you find any passage that says something to the effect of for all
> > wide characters W: mb->wc(wc->mb(W)) = W or an error has to be signaled,
> > then please quote chapter and verse.
> 
> I haven't found one. However, I have written code that will break if
> the converse is not constrained to be true, i.e. wc->mb(mb->wc(X)) = X.
> Is this guaranteed anywhere?

No, it is not guaranteed, but transliteration will not violate
wc->mb(mb->wc(X)) = X, because mb->wc(X) can result only in w-characters that
are available directly in the external coding, and these obviously will
not have to be transliterated by wc->mb. It is however possible that
mb->wc(X) throws an EILSEQ, because it might run into a malformed UTF-8
sequence when we have a UTF-8 locale, or in a >0x7f character if we have
an ASCII locale. Unless the latter is a problem, your code should not
break in the presence of transliteration. In other words:

   wc->mb(mb->wc(X)) = "?" ==> X = "?"

> Stuff like gettext messages is converted by iconv anyway.

I would much prefer if it used portable ISO C functions. The usage of
iconv here seems to me more like a historic artefact, especially if it
is used to convert Unicode into the locale encoding. Why should I
actually touch the names of the encodings involved, when I can leave all
this far more elegant and portably to the library internally with the
locale-dependent ISO C functions?

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/