[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: utf-8 and well-formed but illegal chars
On Sun, Feb 12, 2006 at 05:27:16PM +0000, Markus Kuhn wrote:
> If you worry about UTF-16 at all, then I think you should also worry
> about these two [fffe and ffff]. Otherwise, there is no point in
> worrying about surrogates either.
Actually this got me thinking about whether it's necessary or
appropriate to bother with signalling errors for surrogate codepoints
and noncharacters at all when decoding UTF-8 in mb[r]towc or other
similar interfaces. The error conditions basically are:
- Overly long representations: these are inherently a security problem
when using UTF-8 because they make the round-trip map between UTF-8
and UCS a non-identity map in some cases.
- Surrogates: these have no security implications as long as the
encodings in use are only UTF-8 and UCS character numbers (wchar_t).
They only become a problem if someone converts to UTF-16 by applying
the identity map to all code points below 0x10000 without checking
for illegal surrogates, in which case their presence will make the
round trip between UTF-8 and UTF-16 non-identity.
- FFFE: no implications for UTF-8 and wchar_t only system. When
converted to UTF-16 or UTF-32, may cause systems which honor a BOM
to misinterpret the text entirely, which may have security
implications (e.g. 2F00 gets interpreted as 002F).
- FFFF: may be interpreted as WEOF by broken systems with 16bit
wchar_t. Otherwise a non-issue.
If UTF-8 is going to be the universal character encoding on *nix
systems (and hopefully Internet protocols, embedded systems, and all
other non-MS systems) for the forseeable future, it's in the utmost
interest of users for performance to be maximized and code size to be
minimized. Otherwise there is a strong urge to stick with legacy 8bit
encodings.
Of the above error conditions, only overly long sequences affect a
system that only uses UTF-8 and wchar_t, which is the vast majority of
applications. I strongly wonder whether checking for surrogates and
illegal noncharacter codepoints should be moved to the UTF-16 encoder
(in iconv, or other implementations) and omitted from the UTF-8
decoder. The benefits:
- In the naive C implementation with conditional branches for all the
error condition checks, this eliminates two subtractions and two
conditional branches per 3-byte sequence (basically all Asian
scripts). In very naive implementations, these operations would have
been performed for ALL non-ASCII characters.
- In the optimized C implementation with bit twiddling for error
conditions, this eliminates 4 subtractions, 2 bitwise ors, and 1
bitshift per 3-byte sequence. Cache impact of reduced code should be
significant.
- In my heavily optimized x86 implementation, this eliminates 19 bytes
of code (~10% of the total function, and closer to 20% if you only
count the code that gets executed for BMP characters), comprising 7
instructions with heavy data dependencies between them, per 3-byte
sequence. I would estimate about 20 cycles on a modern cpu, plus
time saved due to lowered cache impact.
Naturally the worth of these gains is very questionable. NOT because
computers are "getting faster" -- the idea that you can write slow
code because Western Europe and America have fast computers should not
be tolerated among people interested in i18n and m17n for a second!!
-- but because the gains are _fairly_ small. On the other hand, the
practical benefits of signalling surrogates and fffe/ffff as errors in
an application which does not deal with UTF-16 are nonexistant.
Markus, Bruno, and others: I'd like to hear your opinions on this
matter. FYI: isomorphism between malformed UTF-8 and invalid wchar_t
values is totally possible without excluding surrogates. Only the
ideas for isomorphism to malformed UTF-16 suffer.
Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/