[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: intelligent charset recognition for irc



Wed, Oct 04, 2000 at 01:39:18PM +0200, Jean-Marc Desperrier ->
> Maybe a more general test would be to have a set of functions that
> send back a value that says how likely the text is to be a given
> encoding, and to choose at the end the most positive result ?
> Doing this will make it much easier to add additional charsets later.

Yes, this would be a good solution, provided that the user can choose
which functions to use. I really don't care about text in the KOI8-R
charset because I can't read it anyway :)

> And there will be a demand for that sooner or later.
> When I connect on an IRC server, it's not difficult to find channels
> that use japanese encoding (iso-2022-jp). I can imagine they are
> russian, chinese, etc... users too.

Yes. However as a Swede, my main interest is of course messages in Swedish.

> Will your code run on the server or on the client ?
> You will be confronted with the problem that your code needs to be
> transparent for encodings you do not recognize.

It could be run on the server, but my thought was to have it run in the
client, so that old clients that doesn't understand UTF-8 would still
get the iso-8859-x messages.

> How long will be the text on which you have to decide the charset ?
> Will you need to auto-detect for each message transferred ?

It has to autodetect on every message. The Swedish channels I am in have
a mixture of iso646-se, iso-8859-1 and UTF-8 (only me)

> I can give an additonal hint, if the message has several characters
> over 0x80 in a row, or too many characters over 0x80, it's very
> probably not ISO-8859-1.

Providing the message is text. But it could be anything really.

> Mozilla/Netscape 6 has some code to auto-detect charset, unfortunately
> it's not completely generic.

I just found that myself. I will have a look at those functions.

	n.

-- 
[ http://www.dtek.chalmers.se/~d95mback/ ] [ PGP: 0x453504F1 ] [ UIN: 4439498 ]
    Opinions expressed above are mine, and not those of my future employees.
		  Skingra er! Det finns ingenting att förstå!
SIGBORE: Signature boring error, core dumped
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/