[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: illegal UTF-8 sequences
Edmund GRIMLEY EVANS wrote on 1999-11-08 17:36 UTC:
> Markus, were you able to check the official definition of a malformed
> sequence?
My current reading of it is reflected in the malformed UTF-8 test
suite file on
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
which has been carefully arranged such that every line is exactly 79
characters long if every malformed sequence (as I understand the term)
is replaced by one U+FFFD. The file contains systematic examples of all
possible types of malformed sequences and should give you a good test
coverage. If you have problems downloading it via HTTP, it is also in
the examples/ subdir in my font tar ball.
The current view on what an individual malformed sequence is motivated
by simplicity of the decoder. If however we want to arrange for a UTF-8
-> UTF-16 -> UTF-8 roundtrip compatibility of malformed sequences (to
avoid damage to accidentally converted binary files), then we probably
want to use something else. I suggested on unicode@unicode.org last week
to represent malformed UTF-8 by malformed UTF-16, i.e. add 0xDC00 (low
surrugate) to every byte of a malformed sequence. For increased
consistency, this scheme would lead to a different definition of how
long a malformed UTF-8 sequence should be. I'm still spending some
brain-cycles on working out the details (this might perhaps become a
paper at the next Unicode conference if I have time).
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/