[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: illegal UTF-8 sequences



On Fri, 29 Oct 1999, Markus Kuhn wrote:

> Edmund GRIMLEY EVANS wrote on 1999-10-29 10:37 UTC:
> > Is there a recommendation anywhere on how to deal with illegal UTF-8?
> 
> Yes. Read
> 
>   http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
> 
> which is a copy of the UTF-8 ISO standard, especially section R.7.
> 
> It defines what a "malformed sequence" is and if your software receives
> one "then it shall interpret that malformed sequence in the same way
> that it interprets a character that is outside the adopted subset".

I guess then it's ok for Lynx to just drop it completely, if that's
what it does in general with invalid characters? (e.g. invalid C0 / C1
control characters within HTML)

> The is precisely what xterm does now and xterm has been very carefully
> tested on its correct behaviour on malformed sequences. I highly
> recommend to use xterm as a reference on how to handle malformed
> sequences in terminal emulators. 

Since you are holding up xterm as a reference, I have to point out
that it is decidedly incompatible with the linux console wrt malformed
sequences...  The kernel console code just drops them - console.c in
2.3.26 still has the comment

                    /* Combine UTF-8 into Unicode */
                    /* Incomplete characters silently ignored */

Does the incompatible handling matter?  Have you basically given up on
the text mode console as a viable environment for UTF-8?  Or should
the kernel console code be patched for this?  (I briefly looked
through Bruno's
<ftp://ftp.ilog.fr/pub/Users/haible/utf8/linux-2.3.12-console.diff>
and didn't find a change like this in it.)

   Klaus

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/