[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: illegal UTF-8 sequences



Edmund GRIMLEY EVANS wrote on 1999-10-29 10:37 UTC:
> Is there a recommendation anywhere on how to deal with illegal UTF-8?

Yes. Read

  http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html

which is a copy of the UTF-8 ISO standard, especially section R.7.

It defines what a "malformed sequence" is and if your software receives
one "then it shall interpret that malformed sequence in the same way
that it interprets a character that is outside the adopted subset".

The is precisely what xterm does now and xterm has been very carefully
tested on its correct behaviour on malformed sequences. I highly
recommend to use xterm as a reference on how to handle malformed
sequences in terminal emulators. A test file with many malformed
sequences is on

  http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

or in the examples/ directory of

  http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts.tar.gz

I believe that for debugging and security reasons, it is advisable NOT
to silently drop malformed sequences, but to make them visible, such
that the user is made aware that something strange is going on here.

> For communication between a program and a library, for example, I've
> been simply ignoring the bad octets, something like:
> 
>   while ((k = mbtowc(&wc, s, n))) {
>     if (k == -1) {
>       ++s, --n;
>       continue;
>     }
>     s += k, n -= k;
>     do_something_with(wc);
>   }
> 
> If this were a standard way to behave, then perhaps xterm should just
> ignore bad octets, too.

I'd recommend instead to implemented the following:

#define REPLACEMENT_CHARACTER 0xFFFD

   while ((k = mbtowc(&wc, s, n))) {
     if (k == -1) {
       ++s, --n;
       do_something_with(REPLACEMENT_CHARACTER);
       continue;
     }
     s += k, n -= k;
     do_something_with(wc);
   }

This way, you make sure that the information about something odd having
happened is not lost.

It is actually a shame that when mbtowc discovers a bad sequence, that
it cannot signal back how long this bad sequence is. For instance, I
find it nicer to treat a UTF-8 sequence with the last byte missing as a
single malformed sequence, not as a sequence of unexpected bytes. This
is also how I understood the ISO 10646-1 UTF-8 definition text. I'll
have to check ISO C Am.1 whether C doesn't provide some better interface
to the UTF-8 decoder here than just returning -1 for "something is
fishy". The limited API you used doesn't take into account the
self-synchronization capability of UTF-8.

The difference becomes clear in sections 3.3 and 3.4 of the above
mentioned decoder stress test file.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/