[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: UTF-8 support for the ancient shell toys
> From: Bruno Haible <haible@xxxxxxx>
> Date: Mon, 5 Nov 2001 13:17:05 +0100 (CET)
>
> the maintainers (Paul Eggert and Jim Meyering, in particular the
> first of both) hold up the merging processing by
> 1. saying that the patch still requires more investigation from
> their side,
Sure does. Just the other day, for example, I discovered the hard way
that GNU 'sort' has undefined behavior on UTF-8 data when strcoll
fails. This is a real problem for UTF-8 and is an issue that I recall
not being addressed by that patch. The problem is not limited to
UTF-8; it can even occur with unibyte encodings, though in practice I
suspect that it's more common with multibyte encodings like UTF-8.
The problem also affects other text utilities.
I know how to fix the bug but haven't had time to write it up yet.
The way I'd like to fix it is to write a strcoll variant that never
fails, and that is consistent with strcoll. That is, xstrcoll(A,B)
must define a total order and must be consistent with strcoll(A,B)
when the latter succeeds.
Also, I'd like xstrcoll(A,B) to return nonzero when strcoll(A,B) fails
and when A is "different" from B, for some reasonable definition of
"different". Writing xstrcoll has proved a little trickier than it
should be, though, so perhaps I'll have to give up doing this nicely.
(I sure wish the multibyte primitives were not so, well, primitive. :-)
> 2. not attributing enough priority and time to this investigation.
That is indeed a problem. We could all use more time.
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/