[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: illegal UTF-8 sequences
> > Is there a recommendation anywhere on how to deal with illegal UTF-8?
>
> Yes. Read
>
> http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
>
> which is a copy of the UTF-8 ISO standard, especially section R.7.
Markus, were you able to check the official definition of a malformed
sequence? I'd be interested to know how many FFFDs I should produce
between the hyphens in each of these cases:
"-\200\200-"
"-\200\200\200\200\200\200\200-"
"-\340\200-"
"-\340\200\340\200-"
"-\200\340\200-"
"-\300\200\200-"
Using A for a start byte, B for a continuation byte, C for an
insufficiently long sequence of continuation bytes and D for a totally
illegal byte, five possible rules for producing FFFDs would be:
- (A|B|D)+
- (A|B|D)
- AC|B|D
- AC|B{1,6}|D
- AC|B+|D
There may be other possibilities, but all of the above seem reasonable
from at least one point of view.
But perhaps it's a bit early in the week for this sort of pedantry ...
Edmund
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/