[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Expat XML Parser Full Character Encoding Support
Michael B. Allen writes:
> The popular Expat XML parser does not have built in support for handling
> character sets other than UTF-8, UTF-16, ISO-8859-1, and ASCII.
This is usually sufficient. I've never seen an XML file in anything
else than ISO-8859-1 or UTF-8.
> For example, for EUC-JP, I think I would have to populate the map with
> the ASCII character set, put -2 in the 80 to FF range,
This is not correct. EUC-JP also has 3-byte sequences.
0x80..0x8D -> -1
0x8E -> -2
0x8F -> -3
0xA1..0xFE -> -2
0xFF -> -1
> and assuming the platform is __STDC_ISO_10646__ I would use a
> wrapper convert function to the euc_jp_mbtowc function from
> libiconv.
You don't need __STDC_ISO_10646__, because although the function is
called euc_jp_mbtowc, it doesn't use the wchar_t type. Also, consider
using the iconv() function itself, so that on Linux you can use the
one in glibc.
> My question is, can I create such a handler that builds against the
> libiconv sources that does not require semantic information about each
> encoding?
Yes, you can mechanically extract the needed information by calling
iconv() once for every possible iconv sequence.
> Is there a way to determine how many bytes will be needed to
> represent each character in a character set?
Yes, just take a look at the conversion tables, e.g. in
libiconv/tests/*.TXT.
> Can I dynamically generate this information with Markus Kuhn's perl
> tools or by some other means?
If you want it to be slow, you can certainly use perl for that
purpose.
Bruno
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/