[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Unicode 3.0.1 fixes UTF-8 spec security problem
> -----Original Message-----
> From: H. Peter Anvin [mailto:hpa@xxxxxxxxx]
> Sent: Thursday, December 28, 2000 9:27 PM
> To: linux-utf8@xxxxxxxxxxxxxxxxxxxx
> Subject: Re: Unicode 3.0.1 fixes UTF-8 spec security problem
>
>
> Followup to: <14892.62974.791969.83105@xxxxxxxxxxxxxxxx>
> By author: Bruno Haible <haible@xxxxxxx>
> In newsgroup: linux.utf8
> >
> > Markus Kuhn writes:
> > > Finally: The Unicode 3.0.1 standard changes the
> definition of UTF-8 such
> > > that overlong sequences must be signalled as an error
> condition by a
> > > conforming decoder, which is what we had recommended
> anyway for a long
> > > time for security reasons:
> > >
> > > http://www.unicode.org/unicode/uni2errata/UTF-8_Corrigendum.html
> >
> > But the sentence
> >
> > "Processes may transform irregular code unit sequences into the
> > equivalent well-formed code unit sequences."
> >
> > based on the definition
> >
> > "An irregular UTF-8 code unit sequence is a six-byte
> sequence where
> > the first three bytes correspond to a high surrogate,
> and the next
> > three bytes correspond to a low surrogate."
> >
> > is the opposite of what you wanted to achieve, isn't it?
> >
>
> Yes, it really is. Anyone knows why they adopted this half-measure
> (it fixes 90% of the problem, but it would be nice if they had avoided
> this additional wart.)
Yes, but there are just too many "UCS-2 only" implementations deployed.
They too may (soon) be faced with UTF-16 data, but will not special treat
the "surrogate" range. There is no particular security issue for the
non-BMP (non-ASCII really) characters, so leaving the already deployed
"UCS-2 only" implementations still Unicode conformant is unproblematical
(from a security point of view), while requireling their update (to make
them conformant) would have been problematical (from a Unicode Consortium
point of view). Note that it says "Processes *may* transform...", and are
not require to. E.g. XML processors are, on the contrary, *required*
(by the XML spec.) to *reject* "irregular code unit sequences", since the
latter are not part of any IETF-labelled character encoding, nor do these
"irregular" code unit sequences conform to 10646.
There is even one more wart: it is allowed to "internally" not check
against overlong sequences (that are now illegal), and internally these
overlong sequences may be converted to (seen as) coded characters, just
as they were before this update. This is allowed just so that processes
that do not have any security issues need not be updated.
These warts are there just to get a smooth transition, while still
"illegalising" the problematic cases. It is not the intention, as I
see it, that "warty" implementations should be included in "new" software,
unless needed for reasons of compatibility with the "installed base"
for the "irregular" case, not for the newly "illegal" case, since those
were always illegal to generate.
Kind regards
/kent k
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/