[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Substituting malformed UTF-8 sequences in a decoder
Bruno Haible wrote on 2000-07-28 14:58 UTC:
> Markus Kuhn writes:
> > > The appearance of U+FFFD is a kind of error message.
> >
> > Agreed. And the appearance of a U+DCxx (which in UTF-16 is not preceded
> > by a high sorrugate) is equally "a kind of error message". Just one that
> > contains a bit (well, seven :-) more information.
>
> The difference is that application writers know how to deal with
> U+FFFD (hollow box, width 1, etc.) But if a byte 0xBB -> U+DCBB
> appears, applications don't know whether it represents an ISO-8859-1
> 0xBB (angle quotation mark) or an ISO-8859-6 0xBB (arabic semicolon).
All UTF-8 applications that I know at the moment treat U+DCBB and U+FFFD
equally, namely they print a hollow box, width one. I don't understand,
what else you would expect. The default behaviour is the desired one
here, no application has to be changed to make this scheme work in
practice.
> It's a problem of the applications. Some application writers think
> that "as many automatic conversions as possible" and "as many
> heuristics as possible" qualifies as smart. Try and teach them.
Well, I do occasionally work at helpdesks and I have to help people in
tears who have lost data in very stupid accidents. In some cases
(loss-less data conversion corrupted data, such as automagic CR
insertation after each LF), a trivial Perl script can recover the damage
and makes people happy, in other cases, I have to tell them to jump into
the Cam and give the usual lecture on backups (result: more tears).
Being involved in the widespread introduction of another class of
transforms (UTF-8 -> UCS-4/UTF-16/etc.), I feel the responsibility of
educating implementors about how a tiny and reall fully compatible
change of the error signalling (0xDCxx instead of 0xFFFD) would leave a
reassuring chance of preventing accidental unrecoverable destruction of
information by editing, saving, or transmitting a file in the wrong
mode, etc. -> fewer users in tears and more heroic results of helpdesk
staff ... :-)
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/