[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
xterm utf8controls
In patch #109 of xterm, we find in the changelog
- add utf8controls resource to specify whether xterm should interpret 16-bit
characters unpacked from UTF-8 form as control characters if they happen
to fall into that range. This behavior is left unspecified by the Unicode
standard (request by Thomas Wolff).
This is an example of an unnecessary option that will just confuse
users by bloating the space of configuration options.
The idea behind this modification is sound: xterm should reject any
overlong UTF-8 sequence. There is no need to make this configurable. A
correct encoder is not allowed to encode LF as a 6-byte sequence, and
xterm should not interpret any byte sequence as LF apart from 0x0a.
One of the next ISO 10646-1 amendments will probably add the concept of
a "safe UTF-8 decoder", which allows only one single byte sequence (i.e.
the shortest one) to represent a Unicode character. If you try to encode
a Unicode character with a UTF-8 sequence that is longer then necessary,
a safe UTF-8 decoder will be required to treat it like an illegal UTF-8
sequence.
If all UTF-8 decoders are safe decoders, this will considerably
simplify the handling of UTF-8 in security critical environments.
Example: Imagine, you have a shell script that tests whether a submitted
UTF-8 message is not longer than 20 lines. With correct UTF-8 usage, "wc
-l" does not have to be modified to process UTF-8 files, because all it
does is counting LF characters (bytes) in the file.
The problem is that unsafe UTF-8 decoders also accept longer byte
sequences that do not contain any 0x0a byte as a representation for the
character LF = U+000a. It would therefore be necessary to extend wc -l
with a full-blown UTF-8 decoder to catch all possible encodings of
U+000a to count them. This would be ugly. Much more convenient is if
UTF-8 on the receiving end do not accept alternative encodings of LF at
all, such that "wc -l" can remain fully ignorant of whether UTF-8
is being used.
Adding a check to a UTF-8 decoder for whether the unique shortest
encoding has been used is trivial. Just check, whether a UTF-8 sequence
starts with any of the following illegal byte combinations:
11000000
11100000 100xxxxx
11110000 1000xxxx
11111000 10000xxx
11111100 100000xx
With a safe UTF-8 decoder, the utf8controls resource of xterm becomes
redundant and a lot of other potential problems are fixed at the same time.
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/