[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Detection of UTF-8 characters in perl.



On Fri, May 04, 2001 at 11:01:50AM +0100, Cameron wrote:
> I may well be completely off list topic, and/or this has probably been
> covered to death in here, but I thought i'd raise the subject again, unless
> someone can point me to a search engine of the list archives :) 
> 
> Essentially, i'm working on some iDNS stuff, and i'm looking for a nice easy
> way to detect whether a string contains a utf8 character. I've looked
> around, of course, and found a few things that seem to tell me it's not
> reliably possible. this may or may not be outdated information :) i've used
> the Convert::Scalar module to check whether the string is marked utf8, but
> it doesn't seem to work on the variables i've passed from a cgi script. of
> course, it seems that this is the grey area. quoting the Perl, Unicode and
> i18N FAQ, "Without a signature you would need a moderate amount of text to
> do a reliable detection. An example of an input source that is probably not
> long enough would be a search widget on a web page."

I wrote a small utility that checks a string for UTF-8 validity a while ago
and I found out that out of approximately 500k lines of varying charset
that contained characters with 8th bit set (gnome translations) about
0.02% of the lines passed as UTF-8 that was not, and almost all of them
were single words in korean. So, you can not be positively sure a given
string really is UTF-8, but you can make a good guess.

cheers/daniel

-- 
jobb: Metamatrix (www.metamatrix.se)
mobilnummer: 0739-442044
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/