[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Choosing right character representation for i18n issues
> If I'm not mistaken, the simplified characters have different codes
> than the traditional ones. I'm not sure, however, whether a user of
> Simplified Chinese would want the document with the Traditional
> Chinese character to appear in the retrieval result.
I am sure he would. The language is the same. He will usually even want
a Japanese document with the word (if it is an infrequent word of at least
two characters) to be shown. E.g. if I look for articles on "Mao Zedong"
or "environmental pollution" or "metal salt precipitation", I can use the
same words for both languages. This applies to about 50% of the search
words people may want to use.
It may even be worthwhile to provide a table for including Hangyl based
Korean texts in such searches. It is easy to convert the Han characters
vor "environmental pollution" into Hangyl and search Korean documents
also. The rate of common words with Chinese may be even higher there.
> If a mapping between different character codes is
> needed, it is of course always possible to do it with a table, but is
> that a feasible solution?
Unfortunately the relation is not 1:1
In dozens of cases, several traditional characters have been merged into
one simplied one, often creating polysemy.
Conversion from traditional to simplified works almost perfectly, but not
vice versa.
Therefore, for the purpose of text comparison, I would convert everything
to simplified, and for the purpose of writing, I would use traditional.
IMHO the mainland Chinese authorities should be lobbied to cooperate in a
proper Han Unification, reforming their reform in such a way that it at
least becomes a 1:1 relationship. The Korean prime minister called for
Han Reunification a few years ago, saying that any future reforms should
be undertaken together by all countries that use the Han writing. IMHO
there is no practical reason not to follow his line, but there is one big
obstacle: the forces of inertia and the reticence of those, who in the
1950s started amateurishly tinkering with what until then had been a
unified writing system. In the case of China, the obstacle may fade away
soon, as the generation of reformers dies and unification of the country
remains a top priority.
> Still another point is that I have heard that not all problems that
> the CJKV users have with UTF-8 are of a technical nature, some of them
> are political. And repairing all the technical problems does of
> course not repair the political issues. Where could I learn more
> about the political and technical sides of this?
I don't know any URLs, but I gave you some taste of how the technical and
political sides may become mixed up in a linux-utf-8 list.
-phm
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/