[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Linux UTF-8 locales sort SPACE at level 4
In the file
/usr/share/i18n/locales/iso14651_t1
in many contemporary Linux distributions (e.g., SuSE 9.3), the line
<U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP>
defines that the space character affects the sorting order with
LC_COLLATE=en_GB.UTF-8 (and in many other locales) at level 4, that is
only if there are no differences in
- base characters
- accents
- uppercase/lowercase
anywhere in the strings being compared.
Is this really what most users expect? I didn't!
The UCA has lots of options, and I think some discussion is needed
on which of these options are most appropriate for a glibc locale,
possibly leading to a revision ore replacement of the of the iso14651_t1
file.
References:
- Unicode Collation Algorithm (UCA), http://www.unicode.org/reports/tr10/
- ISO TR 14652 (draft: http://www.cl.cam.ac.uk/~mgk25/volatile/ISO-14652.pdf)
- http://sources.redhat.com/bugzilla/show_bug.cgi?id=374
- https://bugzilla.novell.com/show_bug.cgi?id=152778
Example:
$ cat >demo.txt
death
de luge
de-luge
deluge
de-luge
de Luge
de-Luge
deLuge
de-Luge
demark
^D
and then try
$ LC_COLLATE=C sort demo.txt
$ LC_COLLATE=en_GTB.UTF-8 sort demo.txt
$ LC_COLLATE=en_GB sort demo.txt
and see the difference with how your dictionary or phone book sorts
these.
Markus
--
Markus Kuhn, Computer Laboratory, University of Cambridge
http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/