[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: intelligent charset recognition for irc



Wed, Oct 04, 2000 at 11:56:14AM +0100, Markus Kuhn ->
> Martin Norbäck wrote on 2000-10-04 10:00 UTC:
> > * Check the message for 8-bit characters
> >   if none ->
> >     * Check the message for {}| inside other text (this should be configurable)
> >       if any -> text is in iso-646-se (or -dk or -de or ...)
> >       else   -> text is in iso-646-us (UTF-8 is just as good of course)
> 
> This is a very crude hack and will fail with a significant rate. I'd
> suggest that only UTF-8 and ISO 8859-1 should be autodetected. Are
> national IRV variants still used that widely here that such a hack with
> guaranteed bad side-effects has to be recommended? I personally doubt
> it. Practically nobody uses hardware that doesn't support ISO 8859-1
> these days and an ISO 646-SE autodetector is far more likely to become a
> part of the problem than a part of the solution. ISO 646 died sometimes
> in the late 1980s as far as I can tell.

Well, on the irc channels I am on, iso 646 is still used, however it is
typed manually by people without access to a Swedish keyboard. You are
right that this is a crude hack, but there is no way to specify your
charset on irc, unfortunately.

As I wrote, this will have to be configurable, though.

> > * Check the message for illegal UTF-8 sequences
> >   if none -> text is in UTF-8
> >   else    -> text is in iso-8859-1
> 
> That should be fairly practical and reliable to do. Would it be worth to
> assume that the text is in CP1252 instead of ISO 8859-1 if the UTF-8
> test fails?

Since CP1252 is a superset of iso-8859-1 nothing is lost with this
assumption. Of course it could only be done if the irc client has the
possibility to display these characters.

Does anyone know of good algorithms to autodect KOI8-R and friends?

Another question is what charset you should use for your own messages to
irc. I think I'll start using UTF-8 and recommend people who complain to
upgrade their clients to an UTF-8-aware one.

	n.

-- 
[ http://www.dtek.chalmers.se/~d95mback/ ] [ PGP: 0x453504F1 ] [ UIN: 4439498 ]
    Opinions expressed above are mine, and not those of my future employees.
		  Skingra er! Det finns ingenting att förstå!
SIGBORE: Signature boring error, core dumped
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/