[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

ASCII and JIS X 0201 Roman - the backslash problem



Hi all,

Tomohiro Kubota, in
http://www.debian.or.jp/~kubota/unicode-symbols-yen.html, explains
the YEN SIGN versus REVERSE SOLIDUS problem.  He writes:

  "Solution is very simple. Just regard YEN SIGN and REVERSE SOLIDUS
   as a different glyphs of the same character. Then, distinction
   between ASCII and JIS X 0201 Roman can be neglected."

I don't think it is a good solution. It will never allow Japanese users
to use the same fonts for ASCII as other users elsewhere.

The way to make it possible for Japanese users to work in a UTF-8 locale
consists of

1) Admit that YEN SIGN and REVERSE SOLIDUS are different things.

2) Never use backslash as a directory separator.

3) For programs that interpret backslash as some kind of escape character
   and use Unicode internally but should work with text in Shift_JIS
   encoding, consider the multibyte character 0x5C as being the escape
   trigger, not [only] the Unicode character U+005C. This is already done
   in bash and gettext. For example, in GNU gettext, we have the code

static bool
mb_iseq (mbc, sc)
     const mbchar_t mbc;
     char sc;
{
  /* Note: It is wrong to compare only mbc->uc, because when the encoding is
     SHIFT_JIS, mbc->buf[0] == '\\' corresponds to mbc->uc == 0x00A5, but we
     want to treat it as an escape character, although it looks like a Yen
     sign.  */
#if HAVE_ICONV && 0
  if (mbc->uc_valid)
    return (mbc->uc == sc); /* wrong! */
  else
#endif
    return (mbc->bytes == 1 && mbc->buf[0] == sc);
}

4) When people convert files from Shift_JIS to Unicode, they need to
   disambiguate the two uses of the character that Tomohiro mentions:
   "When a Japanese person is a writer, it means YEN SIGN in most cases.
    When a non-Japanese person is a writer, it always means REVERSE SOLIDUS."
   These "most cases" need to be distinguished - in a financial text the
   use is likely different than in a shell script. It can not be done
   by the iconv program.

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/