[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Detection of UTF-8 characters in perl.
On Fri, May 04, 2001 at 11:01:50AM +0100, Cameron wrote:
> I may well be completely off list topic, and/or this has probably been
> covered to death in here, but I thought i'd raise the subject again, unless
> someone can point me to a search engine of the list archives :)
>
> Essentially, i'm working on some iDNS stuff, and i'm looking for a nice easy
> way to detect whether a string contains a utf8 character. I've looked
> around, of course, and found a few things that seem to tell me it's not
> reliably possible. this may or may not be outdated information :) i've used
> the Convert::Scalar module to check whether the string is marked utf8, but
> it doesn't seem to work on the variables i've passed from a cgi script. of
> course, it seems that this is the grey area. quoting the Perl, Unicode and
> i18N FAQ, "Without a signature you would need a moderate amount of text to
> do a reliable detection. An example of an input source that is probably not
> long enough would be a search widget on a web page."
I wrote a small utility that checks a string for UTF-8 validity a while ago
and I found out that out of approximately 500k lines of varying charset
that contained characters with 8th bit set (gnome translations) about
0.02% of the lines passed as UTF-8 that was not, and almost all of them
were single words in korean. So, you can not be positively sure a given
string really is UTF-8, but you can make a good guess.
cheers/daniel
--
jobb: Metamatrix (www.metamatrix.se)
mobilnummer: 0739-442044
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/