[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: iconv standardization and conformance test suite
Tomohiro KUBOTA wrote on 2001-06-09 01:30 UTC:
> I have heard some commercial Unix systems
> have curious iconv() behavior. For example, iconv() of SunOS 5.6
> cannot convert "eucJP" -> "UCS-4", though it can convert
> "eucJP" -> "UTF-8" -> "UCS-4". HP-UX 10.x cannot convert
> "eucJP" -> "ucs4" while it can convert "eucJP" -> "ucs2" -> "ucs4".
> It must be "ucs4", not "UCS-4".
> (I heard these problems from Hironori Sakamoto, a developer of
> w3m-m17n.) Do you think such confused situation will be fixed?
> Some systems add BOM while others don't. How about endian?
It seems to me, there is an urgent need for an iconv compatibility test
suite. Perhaps we should build one and ask X/Open to officially endorse
it for the Unix[TM] trademark.
The implementation could be as simple as a little Perl script that reads
in the Unicode mapping tables, adds a few special tests, and then writes
to stdout a huge list of iconv_open, icon, and iconv_close calls with
expected output. A little C program reads these from stdin, executes the
given calls with the given parameters, compares the output and reports
any abnormalities.
The test sequence generator should read in the usual ~30 mapping tables,
implement its own reference encoders for
UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4LE, UCS-4BE, UTF-8, UTF-16,
UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE
and then test all N mappings to UTF-8, and optionally also all N²
mappings from any to any encoding (to catch cheap&dirty non-transitive
implementations of iconv like apparently SunOS 5.6). The test suite
should also check the behaviour with malformed/overlong sequences in
various encodings (especially UTF-8 decoding), and also at least some
transliterations and non-injective mappings (e.g., CP1252 ->
ISO-8859-1//TRANSLIT and UTF-8 -> JIS X 0208). The test suite would come
with an exact list of required encoding names that every conforming
implementation must support. With such a conformance test in place, we
can then easily send bug reports to all vendors if their implementations
do not pass. Such a conformance test should also greatly boost
implementor's confidence in iconv and lead eventually to the removal of
conversion tables from applications.
It should take a programmer with a good Unicode background less than 5 days
to implement a first release. I'd love to do it but really don't have the
time right now ... :-( any volunteers?
Useful references and previous closely related work:
http://www.opengroup.org/onlinepubs/007908799/xsh/iconv.h.html
http://clisp.cons.org/~haible/packages-libiconv.html
http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf
http://www.cl.cam.ac.uk/~mgk25/unicode.html#conv
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
Any volunteers?
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/