[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Choosing right character representation for i18n issues
Kai> I've got a certain application in mind and I'm wondering whether
Kai> using UTF-8 as the underlying character representation in that
Kai> application will be the right choice. The application is full-text
Kai> search. What I have in mind is to do the following: each document is
Kai> converted into UTF-8 for indexing, but the search engine does not
Kai> store the full-text of each document. Instead, only the index is
Kai> stored along with a URL for retrieving the complete document (plus,
Kai> maybe, a document summary comprising author and title, say).
One of the things we do is index and search very large corpora (in the
Gigabyte range) in many different languages. The way we do it is convert all
the text to UTF-16 and then index it (all our tools use UTF-16 internally).
To search properly, three things are critical to doing it right:
1. Text and queries are normalized to decomposed form.
2. Regular expression capability.
3. Han text is marked with special font selection tags and the search code
knows this.
If these three things are done, you will find you need a lot less
language-specific information when searching. Indexing will still require
language-specific information. We have our own in-house Unicode support
libraries, so text entry is done directly in UTF-16.
I have a sort-of working regular expression package available as well as a
Tuned Boyer-Moore implementation, both for UTF-16 text:
http://crl.nmsu.edu/download.html
-----------------------------------------------------------------------------
Mark Leisher
Computing Research Lab The first virtue is to restrain the tongue;
New Mexico State University he approaches nearest to the gods who knows
Box 30001, Dept. 3CRL how to be silent, even though he is in the
Las Cruces, NM 88003 right. -- Cato the Younger (95-46 B.C.E)
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/