[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: illegal UTF-8 sequences
On Fri, 29 Oct 1999, Markus Kuhn wrote:
> Edmund GRIMLEY EVANS wrote on 1999-10-29 10:37 UTC:
> > Is there a recommendation anywhere on how to deal with illegal UTF-8?
>
> Yes. Read
>
> http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
>
> which is a copy of the UTF-8 ISO standard, especially section R.7.
>
> It defines what a "malformed sequence" is and if your software receives
> one "then it shall interpret that malformed sequence in the same way
> that it interprets a character that is outside the adopted subset".
I guess then it's ok for Lynx to just drop it completely, if that's
what it does in general with invalid characters? (e.g. invalid C0 / C1
control characters within HTML)
> The is precisely what xterm does now and xterm has been very carefully
> tested on its correct behaviour on malformed sequences. I highly
> recommend to use xterm as a reference on how to handle malformed
> sequences in terminal emulators.
Since you are holding up xterm as a reference, I have to point out
that it is decidedly incompatible with the linux console wrt malformed
sequences... The kernel console code just drops them - console.c in
2.3.26 still has the comment
/* Combine UTF-8 into Unicode */
/* Incomplete characters silently ignored */
Does the incompatible handling matter? Have you basically given up on
the text mode console as a viable environment for UTF-8? Or should
the kernel console code be patched for this? (I briefly looked
through Bruno's
<ftp://ftp.ilog.fr/pub/Users/haible/utf8/linux-2.3.12-console.diff>
and didn't find a change like this in it.)
Klaus
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/