[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: illegal UTF-8 sequences



> > Is there a recommendation anywhere on how to deal with illegal UTF-8?
> 
> Yes. Read
> 
>   http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
> 
> which is a copy of the UTF-8 ISO standard, especially section R.7.

Markus, were you able to check the official definition of a malformed
sequence? I'd be interested to know how many FFFDs I should produce
between the hyphens in each of these cases:

"-\200\200-"
"-\200\200\200\200\200\200\200-"
"-\340\200-"
"-\340\200\340\200-"
"-\200\340\200-"
"-\300\200\200-"

Using A for a start byte, B for a continuation byte, C for an
insufficiently long sequence of continuation bytes and D for a totally
illegal byte, five possible rules for producing FFFDs would be:

 - (A|B|D)+
 - (A|B|D)
 - AC|B|D
 - AC|B{1,6}|D
 - AC|B+|D

There may be other possibilities, but all of the above seem reasonable
from at least one point of view.

But perhaps it's a bit early in the week for this sort of pedantry ...

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/