[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Serious UTF-8 mbrtowc() implementation error in glibc 2.2



Ulrich Drepper wrote on 2000-03-14 16:19 UTC:
> Markus Kuhn <Markus.Kuhn@xxxxxxxxxxxx> writes:
> 
> >   - mbtowc receives a partial UTF-8 sequence and returns -2, then
> >     it has to keep the already received partial sequence in its internal
> >     state and expect the completion of the sequence in the following
> >     calls
> 
> I've explained you that this is wrong.  You cannot count on it.

No, Ulrich, if you claim ISO/IEC 9899:1999 conformance, then I
definitely can count on mbrtowc() to support single-byte feeding. The
standard leaves no room for interpretation. In addition, the required
functionality is *extremely* helpful for application writers. This
therefore is not just language lawyer nitpicking, but an issue that
makes a big difference for the practical usefulness of the entire glibc
2.2 multi-byte support.

Please read the relevant sections of the ISO C 99 standard again:

It says in section 7.24.6, paragraph 4 (page 375ff):

  "The conversion state described by the pointed-to object is altered as
  needed to track the shift state, and the position within a multibyte
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  
  character, for the associated multibyte character sequence."
  ^^^^^^^^^

UTF-8 does not have shift state, but it still has state related to
partial sequences and so mbstate_t is definitely needed for UTF-8
conversion.

also:

 (size_t)-2  if the next n bytes contribute to an incomplete (but
             potentially valid) multibyte character, and all n bytes
                                                     ^^^^^^^^^^^^^^^ 
             have been processed (no value is stored).
             ^^^^^^^^^^^^^^^^^^^

(quoted exactly from WG14/N843, section 7.24.6.3.2, page 378)

Please read the original ISO C 99 spec of these functions carefully, and
then everything becomes very obvious. Clearly, mbstate_t is *required*
(not just allowed as you assumed so far) to store partial characters and
support byte-by-byte feeding. This makes things considerably simpler for
the implementor than what you had understood so far.

Just think about the poor implementor who is parsing UTF-8 in
variable-length strings that are stored as a linked list of blocks,
where UTF-8 characters could span block borders. This happens in Perl
and many many other applications all the time.

ISO C 99 contains no "may" in this very clearly formulated
implementation requirement. Your current mbrtowc() fails any proper
UTF-8 test suite that will have to includes test calls such as

---------------------------------------------------------------------------
// UTF-8 single byte feeding test for mbrtowc()
wchar_t wc;
mbstate_t s;

wc = 42; /* arbitrary number */
assert(mbrtowc(NULL, NULL, 0, &s) == 0);   /* get s into initial state */
assert(mbrtowc(&wc, "\xE2", 1, &s) == -2); /* 1st byte processed */
assert(mbrtowc(&wc, "\x89", 1, &s) == -2); /* 2nd byte processed */
assert(wc == 42); /* no value has not been stored into &wc yet */
assert(mbrtowc(&wc, "\xA0", 1, &s) == 1);  /* 3nd byte processed */
assert(wc == 0x2260); /* E2 89 A0 = U+2260 (not equal) decoded correctly */
---------------------------------------------------------------------------

If you are in contact with authors of locale test suites, please forward
them this mail and ask them to check whether their test suite covers
single-byte feeding to mbrtowc() and related functions.

If you haven't received a copy of ISO C 99 (ISO/IEC 9899:1999, published
1999-12-01) yet, a fairly recent draft is temporarily on

  http://www.cl.cam.ac.uk/~mgk25/volatile/n2794.pdf

Ulrich, do you now agree?

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/