[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Testing for UTF-8 tty mode



Bram Moolenaar wrote on 1999-09-16 10:17 UTC:
> Markus Kuhn wrote:
> > I have also never been able to imagine an application (not even
> > for debugging really), where it comes in handy to have the keyboard but
> > not the screen in UTF-8 mode, or vice versa.
> 
> It would be useful to display UTF8 without the need to type it.  For example
> for less.  Also for editors, since there are other methods to enter special
> characters (e.g., digraphs).

It doesn't do any harm if the keyboard is in UTF-8 mode for these
applications, on the contrary: "less" is entirely keyboard controlled by
ASCII characters, and ASCII characters are encoded in UTF-8 also as
ASCII characters. ISO 8859-* and UTF-8 are identical for the
0x0000-0x007f range, and this is the range that contains all "less"
commands.

However as soon as you want to enter a regular expression into less, you
really want to have the keyboard also in UTF-8 mode, because otherwise
you couldn't enter non-ASCII characters. It makes absolutely no sense to
have the screen in UTF-8 mode but the keyboard not if you use less,
because string search would become unusable for non-ASCII characters
otherwise.

Therefore, whenever you have the screen in UTF-8 mode, you also want the
keyboard to be in the same mode.

A terminal is a bidirectional communications channel, features like
character echoing tie the both directions semantically very close
together, therefore they obviously should always have the same encoding.

> That doesn't solve the problem of having different types of files on my
> harddisk.

The big vision is to *not* to have plaintext files in different
encodings on the harddisk. The day you switch your system to UTF-8, you
run a big

  find . -type f -exec recode latin1..utf8 {} \;

over your entire harddisk, and from then on, everything is in UTF-8. You
set LC_CTYPE=UTF-8 to tell every application that everything is in UTF-8
now. Applications that process files received from the outside world
(e.g., email readers and web software) will see LC_CTYPE=UTF-8 and will
convert received MIME "text/* ; charset=xyz" files into UTF-8 before
saving them on the harddisk. This way, you never get again non-UTF-8
files onto your system. If you do (e.g., from a floppy disc), use iconv,
recode, etc. to fix it manually, just like you have to fix it manually
today if you read an MS-DOS CP437 file from a floppy.

You will spot non-UTF-8 file quickly, because they look funny in your
UTF-8 terminal emulator. You will therefore quickly convert them to
UTF-8 as soon as you spot one and the problem will be fixed. Just as we
eradicate MS-DOS CP437 files on our Linux partitions quickly by
converting them to the system wide encoding. UTF-8 has the big advantage
that no information is lost during these conversions, because Unicode is
a superset of all other commonly used encodings.

> This depends on the specific file, not on the environment.  The
> only solution for this is that the file itself specifies its encoding.

No, the idea is that there exists a global system encoding and that all
files are converted into this encoding. UTF-8 is ASCII compatible, so
pure ASCII files will not have to be touched at all. We definitely do
not want to carry the MIME character-set-tag mess over into the Linux
file system! We want to have only one single encoding to *avoid* having
to tag a character set. This keeps everything neat and very simple.

> For an editor, you might want to switch dynamically between different modes,
> depending on the type of file being edited at the time.

In the end, you want to have your editor always in UTF-8 mode, because
all your files will be in UTF-8. Just like the Plan9 folks have done
it for half a decade already.

> This becomes really
> "interesting" when using split-windows...  The solution would probably be to
> convert a file when it's read in, and convert it back when written out.
> LC_CTYPE could then specify the internal format that the editor uses, but not
> the format of the file itself, which could be anything.

No, LC_CTYPE specifies the format of the file itself all the time.
LC_CTYPE specifies the system-wide character encoding for all files,
filenames, etc.

You see in what horrible trouble you come if you suddenly try to
introduce the notion of file types into Unix. If you had multiple
encodings simultaneously in one system, the result of a "cat
iso8859-1.txt iso8859-5.txt utf-8.txt" would not be processable by any
software. Do you want to have to add a recoding functionality to
cat-like applications? Certainly not.

> And what if I have one file system with (say, from Windows NT) and one without
> (say from OS/2) that type?  This probably requires specifying the type to
> mount itself.  LC_CTYPE could be used as the default though.

For Windows NT, we have the official definition from Microsoft that its
filenames are always in Unicode, so no problem here. I don't know about
OS/2 files, but I would assume that the user of an OS/2 system has (like
under DOS) to agree on one system-wide encoding used in all OS/2
filenames (typically CP850, I'd expect), and we then have to manually
tell under Linux the OS/2 FS mount command in what encoding the file
names on the OS/ 2 partition should be interpreted. LC_CTYPE would tell
the mount command, in what encoding the the filenames should be
presented to Linux user processes such as "ls". Most of the code is
actually already there, the only thing missing I think is to add UTF-8
as yet another encoding to the OS/2 driver, and to make mount read and
interpret LC_CTYPE.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/