[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Transliteration for use in UTF-8 locales



Edmund GRIMLEY EVANS wrote on 2000-10-13 09:08 UTC:
> Is it really a good idea to do transliteration in the locale?

Yes, I definitely think so. If not, there would have to be an extra
transliteration API that in practice most developers would not bother to
use. By making transliteration part of the locale, that is an integral
part of the wide character to multi-byte conversion process, any
application that has undergone what I called "hard conversion" in my FAQ
to be UTF-8 enabled (i.e., use wchar_t internally and let the library do
the UTF-8 I/O conversion) will automatically also start to behave
gracefully on old ASCII terminals, because UTF-8 and transliteration are
produced by the very same mechanism. That sounds very elegant, robust
and convenient to me. On the other hand, if wcrtomb() or wprintf()
produced an error code for every character not available in the external
output character set and the application programmer becomes responsible
for trapping all these errors and transliterating them into "?" or
something nicer, this sounds like a horribly inconvenient scenario to me
that most programmers can not be bothered to follow.

> Is non-injective wcrtomb already part of some standard?

Note that wcrtomb() has never been injective in any standard! Most
particularly not in ISO C99 or ISO C90/Amd.1. The standard contains
nothing that would prevent transliteration in wcrtomb().

Let's have a quick look in the holy scripture (ISO/IEC 9899:1999(E)):

       5.2.1.2  Multibyte characters

       [#1]  The  source  character  set  may   contain   multibyte
       characters,  used  to  represent  members  of  the  extended
       character set.  The execution character set may also contain
       multibyte  characters, which need not have the same encoding
       as for the source character set.  For both  character  sets,
       the following shall hold:

         -- The  basic  character  set  shall  be  present and each
            character shall be encoded as a single byte.

  !!     -- The  presence,  meaning,  and  representation  of   any
  !!        additional members is locale-specific.

         -- A  multibyte  character  set may have a state-dependent
            encoding, wherein each sequence of multibyte characters
            begins  in  an  initial  shift  state  and enters other
            locale-specific shift states  when  specific  multibyte
            characters  are  encountered in the sequence.  While in
            the initial shift  state,  all  single-byte  characters
            retain  their usual interpretation and do not alter the
            shift state.  The interpretation for  subsequent  bytes
            in  the  sequence  is  a  function of the current shift
            state.

         -- A byte with all bits zero shall  be  interpreted  as  a
            null character independent of shift state.

         -- A byte with all bits zero shall not occur in the second
            or subsequent bytes of a multibyte character.

The authors of the standard very clearly did not want to exclude
non-injective transliteration as part of the wc->mb conversion process.
If you find any passage that says something to the effect of for all
wide characters W: mb->wc(wc->mb(W)) = W or an error has to be signaled,
then please quote chapter and verse. I have searched thoroughly for a
requirement like that in ISO 9899:1999 and could not even find the
slightest hint towards that (fortunately).

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/