[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: UTF-8 curses
Thanks for the URLs. I found out from
http://www.UNIX-systems.org/online.html that my use of addch to send a
single octet of a multibyte character is wrong, so I'm changing my
code.
Although the UNIX specification is quite detailed on how multibyte and
wide characters are handled, I'm still very unclear about locale
issues and how to get into and out of UTF-8 mode, so I'll present a
littany of questions, proceeding from the application to the terminal:
Does the C library provide a way of converting to and from UTF-8?
Unfortunately, mbtowc and wctomb use a locale-dependent multibyte
representation, which is no good for processing e-mail, and I probably
don't want to reset the locale, either. Never mind, I can reimplement
those functions anyway. (There's the same problem with isspace,
isupper, etc; ctype is locale-dependent, so I can't use those
functions for parsing RFC-822 headers, etc.) It would be nice,
however, if I could avoid unnecessary conversion if the case where I
receive a string in e-mail in UTF-8, and want to send the string to a
library that expects to receive a locale-dependent multibyte string,
and the locale is in fact UTF-8, so I shouldn't have to convert twice.
(On the other hand, I might want to check the UTF-8 is valid anyway,
so perhaps this isn't a problem.)
The UNIX spec says curses addstr takes a multibyte string (which may
or may not be UTF-8). This means that it's not compatible with the
traditional 8-bit curses. So I'm supposed to have two versions of the
library on my system and link to the appropriate one, am I? I don't
think I have a problem with that, apart from it being a waste of
memory, but I want to know whether there is any point in implementing
a function to switch curses into and out of UTF-8 mode.
For simplicity, let's assume curses implements all or none of the UNIX
spec, so we don't have to worry about how curses tells the application
that it can do double-width chars, but not non-spacing chars, etc. We
can also assume that both curses and the application are linked
against the same C library so there is no danger of disagreeing about
wcwidth. Except my C library doesn't have wcwidth, so I would be
interested in an official definition, if anyone has one. (I saw Markus
has a function in his web page to identify double-width chars, but I
want to know about non-spacing chars as well.)
Unfortunately we can not assume that the terminal implements all or
none of Unicode. (The Linux console has wcwidth always 1, for
example.) How can curses tell whether the terminal understands
double-width and non-spacing characters, or, indeed, whether the
terminal understands UTF-8 at all? Also, is there a danger of them
disagreeing about which characters are double-width or non-spacing? If
there is confusion, more than just the character concerned will be
affected because curses will become deluded about the position of the
cursor.
Finally, is there any way an application or library might be able to
tell whether the terminal is capable of displaying a particular
character (in the current font)? Lynx, for example, already contains
tables for stripping diacritics when the character with diacritic is
not in the current 8-bit display charset, but all this is lost when
the display charset is utf-8. So, ironically, changing the display
charset to utf-8 can make a web page less, not more readable.
Edmund
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/