[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Testing for UTF-8 tty mode
Andries.Brouwer@cwi.nl wrote on 1999-09-16 00:42 UTC:
> My big vision is that /etc/profile just has to contain the line
>
> export LC_CTYPE=UTF-8
>
> and suddenly my system behaves on all levels like Plan9, i.e. ISO 8859-1
> is replaced absolutely everywhere with UTF-8. UTF-8 is used in
> filenames, environment variables, config files, C source code (relevant
> for the interpretation of L"..." strings), as the multi-byte encoding by
> the C library, in standard input/output, etc., mount passes it down to
> the foreign file system drivers, stty passes it down to the ttys (where
> it might resurface on the other side in an xterm or in the console/
> keyboard kernel driver), etc. Ext2fs treats filenames only as byte
> sequences and remains fully ignorant of the character encoding.
>
> A good plan.
> (But note that various other filesystems have built-in ideas
> about the character set of the filenames.)
Note that I did already sepcifically write above: "... mount passes it
down to the foreign file system drivers ...". Yes, I do agree that some
file systems have a specified character encoding and in this case we
have to convert in the kernel. Indeed, I hope that we can one day also
say that Ext2fs shall only use UTF-8 as its official encoding for file
names. By that time, UTF-8 should be the only encoding used on GNU
systems, such that Ext2fs will never need any conversion code.
> Probably Bruno's IUTF8 bit can replace the ioctl for the keyboard.
> That would reduce us to two again. That is reasonable enough -
> one should be able to control input and output separately.
Neither xterm nor Kermit can activate separately an UTF-8 mode for
screen and keyboard, and apart from you, I have never encountered anyone
who had an idea about why this would be desireable. It just adds to the
state space and is just another potential configuration hazard.
Markus' law, version 0 (a corollary of Murphy's law):
The more things can be configured, the more things can be configured
wrongly. Additional configuration options are a fundamentally bad
thing and they have to be carefully justified each time another
one is added. The fewer buttons there are on your gadget, the better.
Introducing Unicode has the goal of getting rid one one major
configuration option in the long term: there will one day be only
character set, so you can't run things accidentally any more using the
wrong one.
Golden rule: When you add Unicode support to a system, please make sure
to keep the configuration options down to the absolutely necessary
minimum. Ideally, there should just be one big switch that toggles
between Unicode on one side and old-mess on the other side. If you don't
know which option is better, let's first discuss here the pros and cons
before making it configurable.
> Now the IUTF8 bit has properties very different from those of the ioctl.
> The ioctl sets properties of the keyboard driver, while the bit
> is set by some application programs and not by others.
> Thus, the keyboard driver cannot do the conversion before
> it is known who will read the data; that is, it must produce
> 16-bit values as found in the keymap, and leave it to the tty driver
> to decide whether conversion to UTF8 is desired.
Can't the tty driver just switch both the keyboard and the console
to UTF-8 mode when the IUTF8 bit is set? This way, an "stty iutf8"
is all we need to set the console to UTF-8 mode.
> However, there are technical difficulties, since the tty driver
> expects a byte stream. So, perhaps the keyboard driver should
> always produce UTF8, a byte stream, and the tty driver should,
> if the IUTF8 bit is not set, convert this back (and hope that
> conversion back yields an 8-bit character).
I see no need for the tty driver to engage in any character set
conversion. It should just forward the information that the IUTF8 bit
has been set or cleared to both the console and the keyboard driver, who
then activate or deactivate their respective UTF-8 engines accordingly.
> Note that UTF8 is used here in the proper meaning of the word:
> transformation format, without implication that Unicode is
> involved.
Please please, don't spread such terminological confusion. UTF means
officially "UCS transformation format", and UCS is "Universal Character
Set", the official ISO shortname for ISO 10646-1 = Unicode. You can also
remember UCS and "Unicode Character Set" if that helps. In any case, the
U stands specifically for ISO 10646-1 or Unicode.
Please, never ever think of "UTF-8" as a means to transport anything but
pure Unicode values! If you use it in this sense, please give it a
different name to avoid confusion. Call it "Andries' Transformation
Format". UTF-8 is a means to transport Unicode values and should not be
used to refer to a general 16-bit encapsulation mechanism that could
also transport JIS X0208 etc. In this respect, it is not comparable with
say EUC.
> For example, a user uses ISO-8859-2 and not UTF8
> and has a keymap showing ISO-8859-2 values - single bytes,
> unrelated to Unicode. The keyboard driver uses the transformation
> format to encode these as bytes or byte pairs (not knowing anything
> about the character set the keymap is supposed to be in) and feeds
> this to the tty driver, who converts back to the single ISO-8859-2
> bytes. Awkward, but I suppose in this way the code will become
> simplest.
I would prefer if the keymaps loaded into the keyboard driver would all
have three columns: A key code, an 8-bit character, and a Unicode
character. Depending on whether the UTF-8 mode is active or not, the
keyboard driver switches between the 8-bit and the Unicode table. This
way, you can load your keyboard with an ISO 8859-2 keymap and still get
guaranteed correct Unicode values out when you activate the UTF-8 mode.
The tty driver should not touch the character encoding at all. It only
has to know about UTF-8 vs. 8-bit for handling the editing functions
correctly.
In a sense, we have already the same for the console. Ever glyph (for
keyboard: key) is assigned both an 8-bit (or 9-bit) code as well as one
Unicode (or several) values. So there exists both an 8-bit and a Unicode
view of the glyphs in the console, likewise we can have both an 8-bit
and a Unicode view of the keys in the keyboard driver.
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/