[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: C-Kermit + Unicode
I just tested the UTF-8 -> Latin-1 converter in C-Kermit 7.0.196
Beta.10 using the UTF-8 stress test file
http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-test.txt
I think, Kermit's behaviour when illegal UTF-8 sequences are encountered
could be improved. Illegal UTF-8 sequences are defined in ISO 10646-1
Annex R
http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
section R.7, which specifies that any malformed sequence shall be
treated like a single character that is outside the supported
Unicode subset (which in case of Kermit means substitution with "?").
A malformed sequence is either:
- a first octet that is not immediately followed by the correct
number of continuing octets, or
- one or more continuing octets that are not required to
complete a sequence of first and continuing octets, or
- an invalid octet
The above test file contains all of these.
The Kermit UTF-8 -> Latin-1 converter does at the moment the following
strange things:
- 4, 5, and 6-byte sequences are not handled correctly, e.g.
U+00010000, U+00200000, and U+04000000 are converted to
"\0" instead of "?".
- Lonely continuing octets or sequences of these are just passed
through and not replaced by a "?".
- First octets that are not followed by a continuation byte
are not always translated to "?", and several ASCII characters
following can be swallowed and misinterpreted as continuation
characters.
- UTF-8 sequences with some continuation bytes missing cause
following ASCII characters to be swallowed.
- Several ASCII bytes after each 0xfe or 0xff (which are illegal
in UTF-8) are skipped.
In addition (though not required by the standard yet), it would for
security concerns also be desireable for a UTF-8 decoder to reject
overlong UTF-8 sequences for which a shorter alternative exists. Some
types of security analysis (e.g., processing of escape symbols) are
significantly simplified, if for every UTC character, there exists only
exactly one possible UTF-8 octet sequence that decodes into it, namely
the shortest possible one. The 2-6 byte versions of the U+0000 in
utf-8-test.txt should also be converted to "?".
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/