[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Choosing right character representation for i18n issues



I know very little of i18n issues, and I'm not sure that this is the
right place to ask.  If it isn't, please give me a pointer such that I
may ask the question in a more appropriate place.

I've got a certain application in mind and I'm wondering whether using
UTF-8 as the underlying character representation in that application
will be the right choice.  The application is full-text search.  What
I have in mind is to do the following: each document is converted into
UTF-8 for indexing, but the search engine does not store the full-text
of each document.  Instead, only the index is stored along with a URL
for retrieving the complete document (plus, maybe, a document summary
comprising author and title, say).

The query is also converted into UTF-8, and the matching between query
and documents is based on the UTF-8 encoded versions.  If I understand
correctly, there are two potential problems:

(1) There are two UTF-8 character codes which should really be
    regarded the same with respect to character comparison.

(2) Two different characters have the same UTF-8 code which leads to
    false matches in the retrieval result.

I imagine that issue (1) might arise when the user is accustomed to,
say, Simplified Chinese characters but the document contains
Traditional Chinese characters.  If I'm not mistaken, the simplified
characters have different codes than the traditional ones.  I'm not
sure, however, whether a user of Simplified Chinese would want the
document with the Traditional Chinese character to appear in the
retrieval result.  If a mapping between different character codes is
needed, it is of course always possible to do it with a table, but is
that a feasible solution?

I don't know whether a problem of type (2) is potentially possible
with UTF-8 but I have heard about `Han unification' which might be a
possible cause of such a problem.  However, I don't understand any of
this.

Still another point is that I have heard that not all problems that
the CJKV users have with UTF-8 are of a technical nature, some of them
are political.  And repairing all the technical problems does of
course not repair the political issues.  Where could I learn more
about the political and technical sides of this?

In order to clarify, I would like to point out that I'm aware of the
necessity of doing lots of other things in a retrieval system to
properly accomodate different languages.  I believe that the design of
the system we have in mind does provide proper facilities.  One basic
concept is the concept of a data type.  One data type would be
`Western text' where there are words separated by whitespace and
punctuation characters.  There would be a specialization of this data
type for, say, the English language, providing useful features such as
stemming, and maybe even phrase recognition.  Other specializations
are possible for other Western languages.

I imagine that it would be possible to design a `text' data type for
each language that the system should be able to deal with.  For
Chinese text, this data type would recognize the different conventions
used for words and sentences.

The question I asked above is just whether UTF-8 would be a suitable
basis for constructing such a system, or whether it is deficient in
some way.

Last, but not least, there is the possibility that a UTF-8 mailing
list is biased in favor of UTF-8 :-)  I will trust you to provide an
answer properly taking into account the counter arguments against
UTF-8.

PS: Have I used the word UTF-8 correctly?  Hm.  OT1H, this is just an
    encoding and does not specify the set of expressible characters,
    but OTOH, UTF-8 is used only in connection with Unicode, right?
    So...

Thanks a lot in advance for any hints you might have.
kai
-- 
Life is hard and then you die.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/