[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: 3.2 MAPPINGS/EASTASIA
On Sat, 30 Mar 2002, Gaspar Sinai wrote:
> I noticed that at ftp.unicode.org /Public/MAPPINGS/EASTASIA
> has been moved to OBSOLETE directory. README.TXT reads:
>
> The entire former contents of this directory are obsolete
> and have been moved to the OBSOLETE directory. The latest
> information may be found in the Unihan.txt file in the latest
> Unicode Character Database.
> August 1, 2001.
>
> I looked at Unihan.txt file but I found no way to extract
> GB2312.TXT JIS0208.TXT JIS0212.TXT KSC5601.TXT (KSX1001.TXT?)
> OLD5601.TXT and JIS0201.TXT files.
KSC5601.TXT in OBSOLETE/EASTASIA is NOT the mapping
between Unicode and KS C 5601-1987 but the mapping between MS CP949 and
Unicode (sans US-ASCII portion). OLD5601.TXT is the mapping between
KS C 5601-1987(KS C 5601-1992 and KS X 1001:1997) and Unicode 2.0. So
is KSX1001.TXT.
> For instance:
> JIS0201.TXT:
> 0xB1 0xFF71 # HALFWIDTH KATAKANA LETTER A
>
> cd Public/UNIDATA
> grep -i FF71 *.* | grep -i B1
>
> proves that neither Unihan.txt nor any of the other UNIDATA
> files can be used to generate JIS0201.TXT.
>
> The question is: What is the best source for these maps?
> Is there a place where they are centrally maintained?
You can extract two different mappings between EUC-KR
and Unicode from CP949.TXT (in VENDORS/MICSFT/) and KOREAN.TXT
(in VENDORS/APPLE). Just filter out non-EUC portion and keep EUC
codepoints only (that is, 0x00-0x7E for single byte characters and
[0xA1-0xFE][0xA1-0xFE] for double byte characters). If you want
the mapping KS X 1001 and Unicode, you can subtract 0x8080 from
codepoints of two byte characters in EUC-KR. I've put them up
at
http://jshin.net/faq/KSX1001.TXT.gz (extracted from CP949.TXT)
http://jshin.net/faq/JOHAB.TXT.gz (for Johab)
The difference between two mappings are well explained in Apple's
Korean mapping table, KOREAN.TXT
(ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/KOREAN.TXT).
Another difference is that Apple's Korean mapping doesn't have two new
characters added to KS X 1001 in December, 1998. They're EURO SIGN
(U+20AC) at row 2 column 70 (0xA2E6 in EUC-KR and 0x2266 in ISO-2022-KR)
and REGISTERED SIGN (U+00AE) at row 2 column 71 (0xA2E7 in EUC-KR and
0x2267 in ISO-2022-KR). Glibc and libiconv have already added them.
Jungshik Shin
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/