[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Testing for UTF-8 tty mode



Andries.Brouwer@cwi.nl wrote on 1999-09-15 20:34 UTC:
> Today we have two distinct Unicode modes. Bruno adds a third one.
> Is that a good idea? Maybe.

I never liked the idea that that we have two different mechanisms in the
Linux console to activate the UTF-8 mode, and I still consider this just
to be a historic accident:

  - ESC % G to activate it in the part that sends characters to the screen
  - ioctl() to activate it in the part that processes the keystrokes

Terminal emulators (Kermit, xterm, etc.) do both with ESC % G, because
ioctl() doesn't fit very well the serial cable model of the terminal
world. I have also never been able to imagine an application (not even
for debugging really), where it comes in handy to have the keyboard but
not the screen in UTF-8 mode, or vice versa. It is usually only a
situation that comes up temporarily when one side has not yet been
implemented.

> Users: Bruno or somebody (forgive me, I have no memory) told what the
> user does:
> 	setenv LC_CTYPE utf8
> Now user programs know that the data they handle is utf8 encoded.

Yes. Either because they check themselves whether LC_CTYPE matches
".*[uU][tT][fF]-?8.*", or they use internally completely wchar_t and let
the C library worry about any external character encoding issues
using wprintf(), wscanf(), etc.

> Does this also hold for filenames?

Yes, of course.

> Maybe no for vfat or joliet - I understand these come with
> character set information, but have not looked at the details.

What should happen is that mount checks LC_CTYPE and passes this
information down to the vfat driver. I don't think, this has been
implemented yet.

My big vision is that /etc/profile just has to contain the line

export LC_CTYPE=UTF-8

and suddenly my system behaves on all levels like Plan9, i.e. ISO 8859-1
is replaced absolutely everywhere with UTF-8. UTF-8 is used in
filenames, environment variables, config files, C source code (relevant
for the interpretation of L"..." strings), as the multi-byte encoding by
the C library, in standard input/output, etc., mount passes it down to
the foreign file system drivers, stty passes it down to the ttys (where
it might resurface on the other side in an xterm or in the console/
keyboard kernel driver), etc. Ext2fs treats filenames only as byte
sequences and remains fully ignorant of the character encoding.

I want to avoid that we need long HOWTO documents that describe how to
get your entire system converted to UTF-8. LC_CTYPE=UTF-8 seems to be
the best system-wide switch I can think of.

The only thing I don't like about locales is that they usually mix
together localization information and character encoding. However,
LC_CTYPE means explicitely that we are only interested in the character
encoding aspect of the locale (LANG is the whole thing), and therefore
LC_CTYPE=UTF-8, LC_CTYPE=en_GB.UTF-8, LC_CTYPE=de_DE.UTF-8, etc. should
really all be exactly the same.

Can telnet pass the information that we prefer UTF-8 along? Or is telnet
already obsolete (it was forbidden everywhere where I worked over the
last 3 years, as the world has now moved almost completely to ssh
apparently due to password sniffers.)

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/