[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Substituting malformed UTF-8 sequences in a decoder
Markus Kuhn <Markus.Kuhn@xxxxxxxxxxxx>:
> > > Not much good if you're not converting to UTF-16.
> >
> > Well, it works with UCS-4 as well (but I would use a private area for
> > this kind of stuff until it's generally accepted practice to do such
> > hacks with surrogates).
>
> No, this way, you would loose transparency for private area characters.
> If you do in-band signalling of UTF-8 errors in UCS-4, then you must
> only use characters, which are forbidden to be encoded in UTF-8 anyway,
> and these are only the surrogates plus U+FFFE and U+FFFF.
So what should mbtowc(&wc, "\xED\xB2\x80", 3) return?
With the libutf8_plug I have here it returns 3 and sets wc to 0xDC80.
I really don't like the idea of a UTF-8 decoder having to know about
surrogates which have nothing to do with UTF-8. If that sort of thing
starts being imposed, I start to wonder whether Unicode really is too
complex to be secure ...
Edmund
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/