[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Transliteration for use in UTF-8 locales
Edmund GRIMLEY EVANS wrote on 2000-10-13 09:08 UTC:
> Is it really a good idea to do transliteration in the locale?
Yes, I definitely think so. If not, there would have to be an extra
transliteration API that in practice most developers would not bother to
use. By making transliteration part of the locale, that is an integral
part of the wide character to multi-byte conversion process, any
application that has undergone what I called "hard conversion" in my FAQ
to be UTF-8 enabled (i.e., use wchar_t internally and let the library do
the UTF-8 I/O conversion) will automatically also start to behave
gracefully on old ASCII terminals, because UTF-8 and transliteration are
produced by the very same mechanism. That sounds very elegant, robust
and convenient to me. On the other hand, if wcrtomb() or wprintf()
produced an error code for every character not available in the external
output character set and the application programmer becomes responsible
for trapping all these errors and transliterating them into "?" or
something nicer, this sounds like a horribly inconvenient scenario to me
that most programmers can not be bothered to follow.
> Is non-injective wcrtomb already part of some standard?
Note that wcrtomb() has never been injective in any standard! Most
particularly not in ISO C99 or ISO C90/Amd.1. The standard contains
nothing that would prevent transliteration in wcrtomb().
Let's have a quick look in the holy scripture (ISO/IEC 9899:1999(E)):
5.2.1.2 Multibyte characters
[#1] The source character set may contain multibyte
characters, used to represent members of the extended
character set. The execution character set may also contain
multibyte characters, which need not have the same encoding
as for the source character set. For both character sets,
the following shall hold:
-- The basic character set shall be present and each
character shall be encoded as a single byte.
!! -- The presence, meaning, and representation of any
!! additional members is locale-specific.
-- A multibyte character set may have a state-dependent
encoding, wherein each sequence of multibyte characters
begins in an initial shift state and enters other
locale-specific shift states when specific multibyte
characters are encountered in the sequence. While in
the initial shift state, all single-byte characters
retain their usual interpretation and do not alter the
shift state. The interpretation for subsequent bytes
in the sequence is a function of the current shift
state.
-- A byte with all bits zero shall be interpreted as a
null character independent of shift state.
-- A byte with all bits zero shall not occur in the second
or subsequent bytes of a multibyte character.
The authors of the standard very clearly did not want to exclude
non-injective transliteration as part of the wc->mb conversion process.
If you find any passage that says something to the effect of for all
wide characters W: mb->wc(wc->mb(W)) = W or an error has to be signaled,
then please quote chapter and verse. I have searched thoroughly for a
requirement like that in ISO 9899:1999 and could not even find the
slightest hint towards that (fortunately).
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/