[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Choosing right character representation for i18n issues



Kai =?iso-8859-1?q?Gro=DFjohann?= wrote on 1999-10-24 18:42 UTC:
> I imagine that issue (1) might arise when the user is accustomed to,
> say, Simplified Chinese characters but the document contains
> Traditional Chinese characters.  If I'm not mistaken, the simplified
> characters have different codes than the traditional ones.  I'm not
> sure, however, whether a user of Simplified Chinese would want the
> document with the Traditional Chinese character to appear in the
> retrieval result.  If a mapping between different character codes is
> needed, it is of course always possible to do it with a table, but is
> that a feasible solution?

You will always need rather large tables before you can to decent
equivalence testing of Unicode strings. The tables that generate the
higher levels of the UCS/Unicode sorting order should be a good starting
place here. People who enter text into search engines usually expect
that strings nearby in the sorting order (e.g., only differing in case,
accentuation, precomposition, etc.) are also found.

> I don't know whether a problem of type (2) is potentially possible
> with UTF-8 but I have heard about `Han unification' which might be a
> possible cause of such a problem.  However, I don't understand any of
> this.

By the way, all this has nothing to do with "UTF-8", which is just a
particular form of byte encoding of the "Universal Character Set (UCS)"
and "Unicode", which is commonly used under Unix for reasons of
compatibility with ASCII software. There are many other encodings
(UCS-4, UTF-16, various normalization forms, sorting forms, compression
forms, etc.), some of which might be more suitable for index database
files then UTF-8.

> Still another point is that I have heard that not all problems that
> the CJKV users have with UTF-8 are of a technical nature, some of them
> are political.

As a Japanese friend of mine put it nicely recently: many of the
Japanese "experts" who engage in anti-Unicode flame wars are usually
hardcore geeks with imaginary requirements far away from every-day
reality. We have such geeks on the Unicode mailing list for almost any
common language, not just Japanese.

> And repairing all the technical problems does of
> course not repair the political issues.  Where could I learn more
> about the political and technical sides of this?

It is not really a serious political issue, more an issue among a few
vocal Japanese geeks (many of which have neither read JIS X 0221 nor
studied the unihan database). UCS is now an official Japanese Industrial
Standard (JIS X 0221) and it is widely used for Japanese word processing
(MS-Word, etc.). ISO 2022 is a bit like EBCDIC: Not entirely dead yet,
but definitely smelling funny already.

> Life is hard and then you die.

Life is a crime for which there is only one kind of punishment: death.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/