[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF-8 keyboard mode



Bram Moolenaar wrote on 1999-09-16 20:38 UTC:
> 
> Markus Kuhn wrote:
> > It doesn't do any harm if the keyboard is in UTF-8 mode for these
> > applications, on the contrary: "less" is entirely keyboard controlled by
> > ASCII characters, and ASCII characters are encoded in UTF-8 also as
> > ASCII characters. ISO 8859-* and UTF-8 are identical for the
> > 0x0000-0x007f range, and this is the range that contains all "less"
> > commands.
> 
> Hmm, I would expect switching the keyboard to UTF8 mode to take away some of
> the "normal" keys, to allow entering "special" characters.  How else would you
> be able to enter more or different characters with the same keyboard?  Or, in
> other words, if the non-UTF8 mode is fully included in the UTF8 mode, why
> would we ever want to use the non-UTF8 mode?

Ah, I think there are some fundamental misunderstandings here about what
"UTF-8 keyboard mode" means. Let's do this by example:


    user action           ISO8859-1 mode output        UTF-8 mode output

      space                       20                         20
      A                           41                         41
      B                           42                         42
      return                      0d                         0d
      Ä (umlaut A)                c4                         c3 84
      euro                        --                         e2 82 ac


As long as you press an ASCII key, you will get in UTF-8 mode exactly
the same result as in ISO 8859-1 mode. If you press an non-ASCII key
such as Ä, you will get a UTF-8 code that differs from ISO 8859-1, and
this code will be two bytes long. If you press AltGr-E to get the euro
that is now present on all new European PC keyboards but not in ISO
8859-1, then nothing will happen under the old ISO 8859-1 mode, but you
will get the correct 3-byte UTF-8 sequence in UTF-8 mode. New keys will
become available in the UTF-8 mode, other keys will produce the same
character, but encoded in UTF-8.

In any reasonable setup, you expect that the following session always
works flawlessly:

$ cat >test
Schöne Grüße
$ cat test
Schöne Grüße

The exact same things should happen here, independent on whether you are
in Latin-1 or UTF-8 mode with your terminal. Only if you look into the
"test" file, you will see that a different encoding was used. Namely:

If we were in UTF-8 mode:

$ od -t xC test
0000000 53 63 68 c3 b6 6e 65 20 47 72 c3 bc c3 9f 65 0a
0000020

If we were in ISO 8859-1 mode:

0000000 53 63 68 f6 6e 65 20 47 72 fc df 65 0a
0000015

You will not loose the function of any key in UTF-8 mode. The ASCII keys
will result in the exact same bytes, the keys that used to produce ISO
8859-1 GR characters (such as Ä) will produce the UTF-8 code for the
same character, and additional key combinations that had no meaning
before will now produce useful non-ISO8859-1 characters, for instance
AltGr-E will produce the UTF-8 sequence for the euro symbol.

If you want, you can add a universal hex-entry method to the
keyboard driver. For example ISO 14755 

  http://www.cl.cam.ac.uk/~mgk25/volatile/ISO-14755.pdf

suggests that entering hex digits while Shift-Ctrl is pressed shall lead
to the corresponding Unicode character. So for instance, you can enter
double left quotation marks by pressing Shift-Crtl then 2 0 1 8 and then
releasing Shift-Ctrl. This is something the keyboard driver has to take
care off. Vim will already receive the correct UTF-8 sequence and just
has to insert it into the file.

The Shift-Ctrl hex sequence has no effect in ISO 8859-1 mode, so adding
it doesn't take away any keyboard functionality.

The only thing that you loose in UTF-8 mode is the ability to produce
ISO 8859-1 8-bit codes. If you press Ä, you will not get c4 any more.
And that's exactly what you want, because if you are in UTF-8 mode for
the screen, you want of course that if you press the Ä key, that the
exact same code is produced that sent to the screen will result in this
character being displayed.

If you have the screen driver in CP437 mode, then you would expect that
the keyboard is also in CP437 mode, right? Otherwise pressing non-ASCII
keys will lead to very strange echoes if the screen interprets CP437,
but the keyboards send ISO 8859-1 for an Ä. Exact same thing for UTF-8.
You want

$ cat >test
Schöne Grüße
$ cat test
Schöne Grüße

always to be guaranteed to work independent of the selected encoding,
and you do not want any likely configuration options where the above
does not work because screen and keyboard are in incompatible modes.

> > The big vision is to *not* to have plaintext files in different
> > encodings on the harddisk. The day you switch your system to UTF-8, you
> > run the equivalent of a big
> > 
> >   find . -type f -exec recode latin1..utf8 {} \;
> > 
> > over your entire harddisk, and from then on, everything is in UTF-8.
> 
> Aha.  Well, this is worse than switching from a.out to elf.  Don't count on me
> switching to UTF8 for 100% within the next decade.  That command looks like it
> might mess up some data files anyway, I wouldn't dare to let it run on my
> system.

That command was of course a crude simplification. I am sure you are
intelligent and experienced enough yourself to select your files that
are currently ISO 8859-1 encoded and do contain non-ASCII characters
(most likely a small minority of the files in your harddisk) and apply a
conversion tool appropriately to them. You should not convert your email
archive, because it does contain MIME character set tags in the headers
and the email system will automatically treat it correctly.

About the "decade", may I remind you that you have been using ISO 8859-1
for less then 10 years so far? You were probably using a national ISO
646 variant or MS-DOS CP437 before that, right?. X11 certainly didn't
come with ISO 8859-1 fonts back in 1990! You survived the migration from
ISO 646 or CP437 to ISO 8859-1 quite well, didn't you? Be optimistic.

> > You set LC_CTYPE=UTF-8 to tell every application that everything is in UTF-8
> > now. Applications that process files received from the outside world (e.g.,
> > email readers and web software) will see LC_CTYPE=UTF-8 and will
> > convert received MIME "text/* ; charset=xyz" files into UTF-8 before
> > saving them on the harddisk. This way, you never get again non-UTF-8
> > files onto your system. If you do (e.g., from a floppy disc), use iconv,
> > recode, etc. to fix it manually, just like you have to fix it manually
> > today if you read an MS-DOS CP437 file from a floppy.
> 
> I can imagine a lot of problems.  What if I have one application
> that doesn't support UTF8, but does use non-ASCII characters?
> (Don't answer! :-)

Many applications can remain ignorant about the character encoding.
Among the text-terminal mode applications, only those that do some form
of screen formatting have to be aware that bytes in the range 0x80-0xbf
are not characters but just continuation bytes, and therefore do not
advance the cursor. The changes are really not more complicated then the
changes we had to make throughout the first half of the 1990s to get
8-bit characters through, a task that still hasn't been 100% completed.
Nevertheless, we are pretty happy with UTF-8 now.

> > You will spot non-UTF-8 file quickly, because they look funny in your
> > UTF-8 terminal emulator.
> 
> Can you define funny??

A non-ASCII ISO 8859-1 character surrounded by ASCII characters is
guaranteed to form an illegal UTF-8 sequence, which e.g. xterm shows as
an inverted question mark. So if you encounter one, this looks in a
UTF-8 environment like this:

$ cat >test
Sch?ne Gr??e
$ recode latin1..utf8 test
$ cat >test
Schöne Grüße
$

Problem solved. One more ISO 8859-1 file eradicated from my harddisk. We
did it for smallpox, we'll also win over on ISO 8859-1 ... ;-)

You are welcome to add to vim a function that pops up a message along
the lines of:

  This looks like an ISO 8859-1 file. Do you want me to
  convert to UTF-8? Y/N

It is easy to see that a file is not UTF-8, but it is difficult to
guess, which 8-bit encoding it is. You might therefore want to add a
configuration option that tells vim the most likely 8-bit encoding that
it should assume when offering a conversion. A luxury function could
even select a few samples with 8-bit characters from the encountered
8-bit file, and display them in UTF-8 mode using different conversion
tables, such that the user can choose the most plausible one.

> I do understand that switching completely to UTF-8 makes a nice clean system.
> I just don't see it happen for most users.

Why not? It happened for the national variants of ISO 646 as well. Just
let evolution take its path. I don't see a really big problem here. With
Unicode spreading quickly via XML, Java, WinNT, etc., there will soon be
a natural desire to have the same encoding richness also under Unix on
all levels. UTF-8 is there to provide exactly this.

> I am currently using binaries that were compiled more than five years ago.
> This switch to UTF8 probably means I have to get rid of those.

I have many binaries that can remain completely ignorant about the
difference between character encodings as long as the encoding is ASCII
compatible (which UTF-8 is).

Under Linux, it is at the moment necessary to recompile everything every
2-3 years, because of major incompatible changes. We had a.out->elf, we
had libc -> glibc, and we still grow exponentially. We are not
Microsoft. We are not slaves of a backwards compatibility dogma. We have
the source and we use this power effectively. Hanging on to ancient
binary-only installations is an evil idea that brought us Y2K,
Windows98, and other catastrophes. Linux is well beyond that. (And I
hope it remains that way.)

Seriously, the development lifecycle of free software is so fast that
adding UTF-8 really seems like a minor exercise, especially once some
critical mass has been reached and it becomes widely fashionable.

In addition, we do not force anyone to migrate at a certain point to
UTF-8. That is the idea behind using LC_CTYPE=UTF-8 as the big
system-wide switch that allows you to move within minutes your entire
system over to the new encoding. (Smooth as the start of a new
century ... ;-)

> That is not an
> attactive option...  At least with the switch from a.out to elf I was able to
> recompile the programs.  For the switch to UTF8 the sources need to be
> changed.  That is much more complicated.

For my daily use, I have only identified 14 programs that have to be
changed. Four of these do already support UTF-8, 3 more will soon.
 
> There are already many different encodings being used.  In Vim there is
> the 'fileencoding' option.
>  Possible values currently are: 
>	    ansi default setting, good for most Western languages
>	    japan	set to use shift-JIS (Windows CP 932) encoding
>	    korea	set to use Korean DBCS
>           prc		use simplified Chinese encoding
>	    taiwan	use traditional Chinese encoding
>
> I intend to add "utf8" to this, that's why I am in this group.

Great! But serious question: Are many of your users using more than one
of the vim fileencoding options simultaneously? Do you have people who
switch forward and backward between japan and korea with vim? Wouldn't
these people be the *first* ones who'd rather prefer to convert their
files entirely to Unicode to avoid having to switch encodings all the
time?

> I'm aiming at supporting a mixed environment, that is why I was asking how
> this would work.  Apparently you are aiming at a single-encoding
> environment.

These environments do not exclude each other. I am sure you realize that
a single-encoding environment is just a convenient and simple special
case of a more general mixed encoding environment.

> That won't be of help to me then.  Is this group just for making a
> single-encoding system?  In that case I better unsubscribe...

You misunderstood me completely. It is perfect if you add utf-8 just as
another file encoding option to vim. People who want to use exclusively
utf-8 can do this using vim, people who want to switch between prc and
utf-8 for some reason can also to so.

Switching between utf-8 and prc will be just as convenient or
inconvenient as switching between japan and prc. Sooner or later, users
will naturally discover that utf-8 alone is good enough for their
purposes and they will just convert all their files. Vim will be
suitable for all these scenarios. Just add it.

> OK, so the LC_CTYPE tells the OS/2 filesystem how to present characters
> towards applications.

Yes.

> Still remains the question of how to tell it what
> encoding is used in the actual file system.
> Perhaps it's fixed, like with NTFS, then it's simple.

Yes. If it is not fixed, then the character set that was selected when
OS/2 was installed has to be told to "mount" via a command-line option.
That's almost as simple.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/