[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Transliteration for use in UTF-8 locales



Fri, 13 Oct 2000 14:58:28 +0100, Edmund GRIMLEY EVANS <edmundo@xxxxxxxx> pisze:

>  - sending data to a child process through command-line arguments and
>    pipes, e.g. mutt talking to gpg
> 
>  - sending strings to a library, e.g. mutt talking to curses
> 
> It seems to me that these automatic-transcription locales might cause
> some problems.

I am currently designing and implementing a charset handling framework
for Haskell. Here is how I am treating transliteration; I don't know
if it will be useful in the context of C.

An important type is Conv, which is similar to iconv_t. It represents
a possibly stateful conversion of a character or byte stream which
is taking place. Conversions are available in the form of values of
another type (IO Conv), which is able to produce objects of type Conv:
fresh conversions.

Conversions are not obtained from a static central database, as
it is with iconv_open. Conversions can be constructed at any time,
just like all other objects. Libraries provide conversions or ways
to produce conversions.

A conversion consumes blocks of input and produces blocks of output
together with an error flag. They always produce some output, and
callers decide what they want to do with it in case of an error.
In practice error may come from malformed input or characters
unavailable in the output encoding.

Conversions are attached to I/O handles (separate for input and
output), but may be also used directly.

Some conversions already provided (or in progress of implementation):
* Unicode <-> some important concrete charsets, including UTFs.
* Unicode <-> the default local byte encoding. Currently it means using
  nl_langinfo(CODESET) and iconv. These two conversions are used by
  default on new I/O handles.
* Composition of two conversions.
* No conversion.
* Make a conversion from a concrete function transforming a block
  of text.
* Make a conversion from a table of 256 characters, in either direction.
* Use iconv, given two names and a string to insert in case of an error
  (in addition to returning the error flag set).
* Take a conversion and make one that produces the same output but
  ignores errors. Useful for cases where errors would throw exceptions,
  as it is with file I/O, and we don't want them.
* Take a conversion and an Improver, make an improved conversion.
  An Improver is a function which tries to find a substitute for
  an unavailable character. An improver also takes an availability
  tester for other characters in case it wants to know about other
  characters. The produced conversion will use the improver in case
  the original conversion was not able to understand a character
  (maintaining a cache of responses for efficiency).
* An application of the above conversion: improve a conversion using
  an approximation table. I provide my own table, but anybody may pass
  his own instead of mine, or mine modified. A table is a mapping
  from characters to lists of strings: possible approximations in
  the order of preference. Such string may contain a special marker:
  characters after that marker are not a part of the substitution,
  but their availability is required to use that entry. It is
  useful in some cases, e.g. for semigraphics, and for deciding
  whether to transform a cyrillic letter to one that looks the same
  (Macedonian dze -> Latin s) or to transliterate it when the rest
  of Cyrillic alphabet is transliterated too. There is a function
  which transforms an approximation table by selecting only entries
  that don't change the length, but it should be more complex in the
  presence of variable width.

So transliteration is not used by default. I thought it would be too
dangerous, especially as a conversion will be used by default for
all I/O, which is needed because internal representation of text in
Haskell is Unicode; an evil user could pass characters that will not
be caught as dangerous but transliterated on output. Also it would be
too inefficient for something used by default. But a program may easily
attach an improved conversion to its I/O handles, including stdout.

A case where a kind of transliteration caused trouble in practice.
Windows 95 does funny things when one tries to use particular
characters in filenames from the Explorer. Some characters, mostly
from the range 128..191, are for some weird reason translated, in
different ways in short and long names. E.g. double angle quotation
marks are replaced with < and >. Since they are not legal characters
in a Windows filename, the file is no longer accessible :-)  Other
characters make the file inaccessible only under its short name
(e.g. bullet is replaced with '\7' in the short name, other characters
are replaced with lowercase letters which are illegal there too) or
only under its long name. Various kinds of confusion may be observed.

> I hope at least that wcwidth(wc) gives the appropriate result, i.e.
> wcwidth(L'ĺ') is 2 if 'ĺ' is going to be transcribed as "aa"
> by wctomb.

This approach fails for context-sensitive transliteration...

-- 
 __("<  Marcin Kowalczyk * qrczak@xxxxxxxxxx http://qrczak.ids.net.pl/
 \__/
  ^^                      SYGNATURA ZASTĘPCZA
QRCZAK

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/