[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Transliteration for use in UTF-8 locales
Edmund GRIMLEY EVANS wrote on 2000-10-13 10:52 UTC:
> I have always assumed that the multibyte representation is appropriate
> for use as an internal representation, not just as a presentation for
> display to human readers.
Multibyte representation is appropriate for internal representation
primarily if it does not throw away information in unexpected ways and
if it is locale independent. Use iconv and specifically name the
internal representation that you want to have, and it will be suitable
for your needs. UTF-8/UTF-16/UCS-4/etc. will usually all be good
choices. The locale does primarily determine what the user sees at the
interface of the program, and you should not use locale-dependent
functions for internal representation of data that you want to have
certain guarantees independent of the locale. For example, it would
usually be foolish to write the data of a large multi-user database into
the database files using locale-dependent functions. The interpretation
of the database would become locale dependent, which is unlikely to be
what you want. However when you extract data from the database and
present it to the user (e.g, by sending it to stdout or by saving it
into a file for the user), then the locale-dependent functions become
highly appropriate. The output of gettext for example should usually go
through locale-dependent functions, because it is typically intended for
the user, not for internal mechanisms.
All this is summarized by a very simple design rule:
Use wcrtomb() etc. if you need something converted to UTF-8/etc. because the
user wants to have it in UTF-8/etc. according to her locale.
Use iconv() if you need something converted to UTF-8/etc. because the
specification of your application says so.
If you follow this, neither strange locale settings nor transliteration
will affect your system. The user gets what she asked for, and you
remain in full control of your internal text representation.
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/