[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF-8 tty mode



Markus Kuhn wrote:

: My big vision is that /etc/profile just has to contain the line
: 
: export LC_CTYPE=UTF-8
: 
: and suddenly my system behaves on all levels like Plan9, i.e. ISO 8859-1
: is replaced absolutely everywhere with UTF-8.
: ...
: I want to avoid that we need long HOWTO documents that describe how to
: get your entire system converted to UTF-8. LC_CTYPE=UTF-8 seems to be
: the best system-wide switch I can think of.

This way, UTF-8 configuration would stand out positively against almost 
every other feature that can be installed with Unix/Linux.
I hope your proposal succeeds.

From: Markus Kuhn <Markus.Kuhn@cl.cam.ac.uk>
: Can't the tty driver just switch both the keyboard and the console
: to UTF-8 mode when the IUTF8 bit is set? This way, an "stty iutf8"
: is all we need to set the console to UTF-8 mode.
That way or the other...
Which configuration solution (environment/stty/...) would have the 
best chance to get transparently through rlogin and telnet connections?
This should be an aspect, I think.


From: Andries.Brouwer@cwi.nl
: ...
: However, there are technical difficulties, since the tty driver
: expects a byte stream. So, perhaps the keyboard driver should
: always produce UTF8, a byte stream, and the tty driver should,
: if the IUTF8 bit is not set, convert this back (and hope that
: conversion back yields an 8-bit character).
I can image there is certainly good technical reason for the distinction 
between tty driver and keyboard/screen driver in Unix/Linux. From the 
point of view of the user and even the application (non-system) programmer, 
it remains a bit artificial, though.
I think it is definitely a MUST for any configuration issue discussed 
here that the user does not have to care about this structure.


-------------------------------------------------------

: The only thing I don't like about locales is that they usually mix
: together localization information and character encoding. However,
Indeed, I have just hit this problem myself.

: LC_CTYPE means explicitely that we are only interested in the character
: encoding aspect of the locale (LANG is the whole thing), and therefore
: LC_CTYPE=UTF-8, LC_CTYPE=en_GB.UTF-8, LC_CTYPE=de_DE.UTF-8, etc. should
: really all be exactly the same.
I would wish to tell today's software (e.g. some applications that 
unfortunately uses curses...) somehow just to behave 8-bit-clean, without 
being language-specific (i.e. without the need to set something like 
de_DE; after all, American users may want to handle 8-bit I/O as well 
occasionally) and without the need to have some locale database 
installed on the system (which often just isn't installed because of 
the many HOWTOs that the system administrator would have had to read...).
Why isn't that simple thing possible nowadays??


-------------------------------------------------------

: Can telnet pass the information that we prefer UTF-8 along? Or is telnet
: already obsolete (it was forbidden everywhere where I worked over the
: last 3 years, as the world has now moved almost completely to ssh
: apparently due to password sniffers.)
I have already wondered why Linux telnet is still far behind the time 
(it has to be used in many environments) - it's not 8-bit-clean by 
default, you have to tell it "-8" and then it stupidly also switches 
off some reasonable CR-LF logic so that you may have to terminate 
login and password with ^J instead of RETURN and screen output doesn't wrap...
Will this eventually be fixed by someone?


-------------------------------------------------------

From: Bram Moolenaar <Bram@moolenaar.net>

: ...
: That doesn't solve the problem of having different types of files on my
: harddisk.  This depends on the specific file, not on the environment.
For a transitional period, it might be useful to have a liberal approach 
for file name display or even interpretation of received text;
almost every string that contains Latin-1 characters is illegal with respect 
to UTF-8 so a simple heuristic recognition would be possible.


-------------------------------------------------------

: For an editor, you might want to switch dynamically between different modes,
: depending on the type of file being edited at the time.  This becomes really
: "interesting" when using split-windows...  The solution would probably be to
: convert a file when it's read in, and convert it back when written out.
: LC_CTYPE could then specify the internal format that the editor uses, but not
: the format of the file itself, which could be anything.
My editor mined has auto-recognition and handles both Latin-1 or UTF-8 
encoded files.
The internal format is that of the original file (except for UTF-16 which 
is converted to UTF-8).
Interpretation can, however, be switched while editing. UTF-8 interpretation 
accepts illegal UTF-8 sequences transparently and does not destroy their 
information.


-------------------------------------------------------

Thanks, Markus, for pointing me to this mailing list.

Thomas Wolff
towo@computer.org
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/