[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF-8 curses



Edmund Grimley Evans writes:

> I found out from http://www.UNIX-systems.org/online.html that my use
> of addch to send a single octet of a multibyte character is wrong,

Yes. All C APIs which deal with multibyte strings handle one character
at a time at least. "Pieces of multibyte characters" are not normally
used.

> I'm still very unclear about locale issues and how to get into and
> out of UTF-8 mode

The standardized APIs don't know about an "UTF-8 mode". The major part
of the ISO C APIs is concerned with making your programs work correctly
in the locale which has been set by the user (i.e. depending on the
LC_CTYPE environment variable).

Only one function can be used as a bridge between different locales:
iconv (and on systems where you don't have it: librecode).

For applications which need to deal with only one character set, the
ISO C APIs are sufficient for most simple purposes. Email clients and
browsers, however, have to obey the charset tags, and therefore need to
resort iconv.

> Does the C library provide a way of converting to and from UTF-8?

Yes. Use iconv. Pass it "UTF-8" as one of the two charsets, and determine
the other one from the locale (using code like [1]). In glibc, instead of
"UTF-8" you can also use "UNICODEBIG" or "UNICODELITTLE", which means
16-bit UCS-2.

> Unfortunately, mbtowc and wctomb use a locale-dependent multibyte
> representation, which is no good for processing e-mail, and I probably
> don't want to reset the locale, either. Never mind, I can reimplement
> those functions anyway.

I'd suggest one of the following approaches:

- Do a setlocale(LC_CTYPE,"en_US.UTF-8") once and for all at the start
  of your program. You can then assume that mbtowc/wctomb deals with
  UTF-8 and Unicode, and use iconv to get the data into that format.
  But this will likely not work well if you use curses, because it
  will trick curses into thinking it were running in an UTF-8 xterm,
  when in fact it is not.

- Keep running in the user's locale, and convert the data to the user's
  current locale format using iconv. Be prepared to deal with EILSEQ here.

Given that glibc-2.2 will contain support for all this, I see no point
in spending time to reimplement mbtowc/wctomb yet another time.

> The UNIX spec says curses addstr takes a multibyte string (which may
> or may not be UTF-8). This means that it's not compatible with the
> traditional 8-bit curses.

Sure it is compatible. When LC_CTYPE denotes an 8-bit character set,
every multibyte character is exactly one byte. The "multibyte string"
notion is backward compatible with the old char* "string".

> Except my C library doesn't have wcwidth, so I would be
> interested in an official definition, if anyone has one.

You can take the one from [2]. It implements Markus' function, plus
it knows about non-spacing characters.

> The Linux console has wcwidth always 1, for example.

It would make sense to change that, if the kernel could use big fonts.
The frame-buffer device has no 512-chars limit, right?

> Also, is there a danger of them disagreeing about which characters
> are double-width or non-spacing? If there is confusion, more than
> just the character concerned will be affected because curses will
> become deluded about the position of the cursor.

You could program curses in such a way that it could cope with both
kinds of terminal settings: after outputting a double-width character,
it would output a space too and then a cursor-positioning sequence.

> Finally, is there any way an application or library might be able to
> tell whether the terminal is capable of displaying a particular
> character (in the current font)?

Not that I know of. The typical way is to look at the charset of the
locale ([1]) and then use iconv to convert the character and see if
you get errno = EILSEQ.

                      Bruno

[1] ftp://ftp.ilog.fr/pub/Users/haible/utf8/locale_charset.c
[2] ftp://ftp.ilog.fr/pub/Users/haible/utf8/libutf8-0.5.2.tar.gz


-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/