[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: C-Kermit + Unicode



I just tested the UTF-8 -> Latin-1 converter in C-Kermit 7.0.196
Beta.10 using the UTF-8 stress test file

  http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-test.txt

I think, Kermit's behaviour when illegal UTF-8 sequences are encountered
could be improved. Illegal UTF-8 sequences are defined in ISO 10646-1
Annex R

  http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html

section R.7, which specifies that any malformed sequence shall be
treated like a single character that is outside the supported
Unicode subset (which in case of Kermit means substitution with "?").
A malformed sequence is either:

  - a first octet that is not immediately followed by the correct
    number of continuing octets, or
  - one or more continuing octets that are not required to
    complete a sequence of first and continuing octets, or
  - an invalid octet

The above test file contains all of these.

The Kermit UTF-8 -> Latin-1 converter does at the moment the following
strange things:

  - 4, 5, and 6-byte sequences are not handled correctly, e.g.
    U+00010000, U+00200000, and U+04000000 are converted to
    "\0" instead of "?".

  - Lonely continuing octets or sequences of these are just passed
    through and not replaced by a "?".

  - First octets that are not followed by a continuation byte
    are not always translated to "?", and several ASCII characters
    following can be swallowed and misinterpreted as continuation
    characters.

  - UTF-8 sequences with some continuation bytes missing cause
    following ASCII characters to be swallowed.

  - Several ASCII bytes after each 0xfe or 0xff (which are illegal
    in UTF-8) are skipped.

In addition (though not required by the standard yet), it would for
security concerns also be desireable for a UTF-8 decoder to reject
overlong UTF-8 sequences for which a shorter alternative exists. Some
types of security analysis (e.g., processing of escape symbols) are
significantly simplified, if for every UTC character, there exists only
exactly one possible UTF-8 octet sequence that decodes into it, namely
the shortest possible one. The 2-6 byte versions of the U+0000 in
utf-8-test.txt should also be converted to "?".

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/