[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: UTF-8 keyboard mode
Markus Kuhn wrote:
> Ah, I think there are some fundamental misunderstandings here about what
> "UTF-8 keyboard mode" means. Let's do this by example:
Thanks for the explanation. I now understand that what you call "UTF8 mode"
only involves the encoding of the resulting character. It has nothing to do
with the input mode (they keys that you hit to get the character). That's a
clean separation. Perhaps it's better to call this "UTF8 keyboard encoding",
to avoid the confusion with the input method.
I suppose that theoretically the keyboard could be in one encoding, and the
application translates it into the encoding it wants. This would require the
application to be able to detect the keyboard encoding.
However, simple application like "cat" would not want to translate encodings
at all. It is better to have a setup where the keyboard and screen use the
same encoding. If there might be a situation where they are different, the
application should take care of it. Perhaps for a telnet session to another
system? For example, doing a session to a system that uses Latin-1 encoding.
Even then, both the keyboard and the screen encodings would need to be
translated, thus they would still be the same. I can't think of a situation
where the keyboard and screen would _need_ to be different.
> The only thing that you loose in UTF-8 mode is the ability to produce
> ISO 8859-1 8-bit codes. If you press Ä, you will not get c4 any more.
> And that's exactly what you want, because if you are in UTF-8 mode for
> the screen, you want of course that if you press the Ä key, that the
> exact same code is produced that sent to the screen will result in this
> character being displayed.
This will be a problem for applications that can't handle multi-byte
characters. The current version of Vim can't handle it. Thus if you switch
from 8859-1 to UTF8, Vim will no longer work with non-ASCII characters. I
hope this clearly shows the problem of switching completely to UTF8. I'll
work on UTF8 support in Vim, of course, but there will still be applications
that don't support UTF8 (like people that use an older version of Vim for some
reason; I noticed some people still use Vim 3.0...).
> If you have the screen driver in CP437 mode, then you would expect that
> the keyboard is also in CP437 mode, right? Otherwise pressing non-ASCII
> keys will lead to very strange echoes if the screen interprets CP437,
> but the keyboards send ISO 8859-1 for an Ä. Exact same thing for UTF-8.
Yes, unless the application can detect the difference between the keyboard and
the screen, and translate the encodings. But, as said above, there does not
appear to be a need for this. It's only a theoretical possibility.
[about converting all files to UTF8]
> That command was of course a crude simplification. I am sure you are
> intelligent and experienced enough yourself to select your files that
> are currently ISO 8859-1 encoded and do contain non-ASCII characters
> (most likely a small minority of the files in your harddisk) and apply a
> conversion tool appropriately to them. You should not convert your email
> archive, because it does contain MIME character set tags in the headers
> and the email system will automatically treat it correctly.
I'm quite sure I am not intelligent enough! :-) I hardly ever user non-ASCII
characters myself. But they do appear in files that I got from elsewhere.
And mail messages are spread out all over my system (e.g., as part of some
downloaded tar archive). Hmm, files inside archives would also need to be
converted.
Selecting which files to convert by hand will be an enourmous task. And when
unpacking a newly downloaded archive, I would have to check it again. This
does not sound like an acceptable solution for me.
Automatic detection of the encoding would help a lot. It must be reliable
then. Still, the idea of files being changed automatically when I, for
example, unpack an archive, sounds like a bad idea. I would probably switch
it off. I can think of problems with checksums (shar archives use them).
My conclusion for now: I will not convert all my existing files to UTF8. New
files I receive might be UTF8 encoded, older ones not. I want to use a mix of
encodings and applications that support that mix.
> About the "decade", may I remind you that you have been using ISO 8859-1
> for less then 10 years so far? You were probably using a national ISO
> 646 variant or MS-DOS CP437 before that, right?. X11 certainly didn't
> come with ISO 8859-1 fonts back in 1990! You survived the migration from
> ISO 646 or CP437 to ISO 8859-1 quite well, didn't you? Be optimistic.
I never migrated. I just started finding files that didn't display properly.
Viewing MS-DOS files on Unix is a good example. The solution is to find a way
to display them properly. I never convert the file itself. I remember
working on a HP-UX system, where e-mail messages would show the name of the
company wrong (Oce, with an accented "e"). There was a discussion about how
to solve this, but it was never solved. The only solution was to view the
files from another system (Windows-NT or Solaris), or use another application
that allows switching the encoding (e.g., Netscape).
When UTF8 encoded files are becoming more common, I would probably switch my
system to use UTF8 by default. But when finding files that are encoded
otherwise, there must still be a way to work with them. Translating them to
UTF8 is one way. But quite often I would prefer to keep the encoding
(especially when just viewing them).
> > I can imagine a lot of problems. What if I have one application
> > that doesn't support UTF8, but does use non-ASCII characters?
> > (Don't answer! :-)
>
> Many applications can remain ignorant about the character encoding.
Hey, I said Don't answer! :-)
> Among the text-terminal mode applications, only those that do some form
> of screen formatting have to be aware that bytes in the range 0x80-0xbf
> are not characters but just continuation bytes, and therefore do not
> advance the cursor. The changes are really not more complicated then the
> changes we had to make throughout the first half of the 1990s to get
> 8-bit characters through, a task that still hasn't been 100% completed.
> Nevertheless, we are pretty happy with UTF-8 now.
I disagree. Supporting the 8-bit characters was quite easy, since it only
involved fixing the code that regarded characters above 0x7f as negative
numbers (e.g., recognizing them as EOF characters). Support for multi-byte
characters is complicated. Check out the code in Vim that deals with it
(actually it's just for two-byte encoding right now). It's a lot of code.
UTF8 support will add to that. And we keep finding problems that need to be
fixed. I didn't find an 8-bit problem for years.
Even a simple application that just displays text on the screen with line
wrapping will be affected, because there is no direct relation between the
number of bytes and the space occupied by them. Wrapping lines will be
different, thus every application that uses the line length is affected.
> You are welcome to add to vim a function that pops up a message along
> the lines of:
>
> This looks like an ISO 8859-1 file. Do you want me to
> convert to UTF-8? Y/N
Well, it's possible, but not nice. First of all, users don't like questions
that pop up unpredictably. Second, the user might want to just view the file
and not convert it.
How reliable is the check for the UTF8 file encoding? How much of the file
would need to be read to detect it? What if the file is on a few characters
long? Perhaps you can refer me to research that has been done in this area.
For Vim I would ideally detect the file encoding automatically, like it is now
done for detecting the end-of-line character. However, the end-of-line has
only three possibilities (Unix, Dos and Mac), for the file encoding I already
have half a dozen. If the detection relies on finding a byte sequence that is
illegal in all but one encoding, it will probably not be 100% reliable. It
would be much better if something in the file specifies the encoding.
> It is easy to see that a file is not UTF-8, but it is difficult to
> guess, which 8-bit encoding it is. You might therefore want to add a
> configuration option that tells vim the most likely 8-bit encoding that
> it should assume when offering a conversion. A luxury function could
> even select a few samples with 8-bit characters from the encountered
> 8-bit file, and display them in UTF-8 mode using different conversion
> tables, such that the user can choose the most plausible one.
It's possible. This means a manual selection of encoding needs to be done.
That is in fact how it works now.
Has anyone worked on a method to specify the file encoding with the file? So
that in a mixed environment an application can request the file encoding with
100% accuracy? This could be done by adding a signature to the file. For
file systems that support a resource fork (Mac, NTFS) it could be separate
from the actual file. That doesn't work on many systems though.
I can think of a solution that inserts a byte sequence at the start of a UTF8
file, which represents an empty, non-printing character. This would uniquely
identify a UTF8 file, without showing up when UTF8 encoding is used. Does
such a thing exist already? Still, how would this character be inserted when
using "cat >file"? The "cat" program would need to know about UTF8 then.
> > I do understand that switching completely to UTF-8 makes a nice clean
> > system. I just don't see it happen for most users.
>
> Why not? It happened for the national variants of ISO 646 as well. Just
> let evolution take its path. I don't see a really big problem here. With
> Unicode spreading quickly via XML, Java, WinNT, etc., there will soon be
> a natural desire to have the same encoding richness also under Unix on
> all levels. UTF-8 is there to provide exactly this.
The spread of UTF8 will grow, of course. That it is the way to go is without
discussion. But that doesn't mean the number of otherwise encoded files will
shrink quickly. It takes a long time before an old standard has completely
died out. Simtel carries an awful lot of good old DOS programs, these don't
go away. They just get used less and less. When 99% of the files are UTF8
encoded, there is still that one percent left...
> Under Linux, it is at the moment necessary to recompile everything every
> 2-3 years, because of major incompatible changes. We had a.out->elf, we
> had libc -> glibc, and we still grow exponentially.
Why do you think I prefer FreeBSD? :-)
> We are not Microsoft. We are not slaves of a backwards compatibility dogma.
> We have the source and we use this power effectively. Hanging on to ancient
> binary-only installations is an evil idea that brought us Y2K, Windows98,
> and other catastrophes. Linux is well beyond that. (And I hope it remains
> that way.)
If Linux keeps on doing a big change every two years, users will be _very_
disappointed. So far Linux has been mostly be used by people that develop
software themselves. They see the technologic advantages of those changes.
Normal users will only see the disadvantages. "I just want to send out that
letter, and after the upgrade it prints out wrong characters!".
Let's try to have progress _and_ backwards compatibility. It's not
impossible. In my opinion it is very wrong to discard backwards compatibility
right away.
> Seriously, the development lifecycle of free software is so fast that
> adding UTF-8 really seems like a minor exercise, especially once some
> critical mass has been reached and it becomes widely fashionable.
There is a lot of development, but I wouldn't call this a minor exercise. It
seems you are ignoring the problems that are introduced. Underestimating this
is a big danger.
I would prefer to be prepared for those problems and dealing with them. After
all, it's not the applications that will work that matter, it's the
applications that don't work that matter. If one out of ten applications
doesn't work properly with UTF8, and I need that application, I won't use
UTF8.
> In addition, we do not force anyone to migrate at a certain point to
> UTF-8. That is the idea behind using LC_CTYPE=UTF-8 as the big
> system-wide switch that allows you to move within minutes your entire
> system over to the new encoding. (Smooth as the start of a new
> century ... ;-)
That is NOT a solution. I don't want to use UTF8 for everything or for
nothing at all. I want to use several encodings at the same time, at least
until all the applications that I use handle UTF8 properly.
> For my daily use, I have only identified 14 programs that have to be
> changed. Four of these do already support UTF-8, 3 more will soon.
As I said, it's the programs that do not support it that matter. I don't
think Vim will have good UTF8 support within a year. And it will probably
take another year to fix problems.
> Great! But serious question: Are many of your users using more than one
> of the vim fileencoding options simultaneously?
That is not possible right now. But it is certainly an issue: People in Korea
have a mix of locally written files and files from other countries. You would
open two instances of Vim to deal with that now.
> Do you have people who switch forward and backward between japan and korea
> with vim? Wouldn't these people be the *first* ones who'd rather prefer to
> convert their files entirely to Unicode to avoid having to switch encodings
> all the time?
I don't know. If they own the files, they would probably convert them. But
when working in projects with others, or with distributed files, that is not
an option. Unless the conversion is done on-the-fly when opening the file.
That is probably the best solution. It does require automatic detection
though.
> These environments do not exclude each other. I am sure you realize that
> a single-encoding environment is just a convenient and simple special
> case of a more general mixed encoding environment.
A single-encoding environment is like paradise: It's what everbody wants, but
nobody has it.
A mixed-encoding environment is much more difficult to deal with. That is
where a lot of work still needs to be done, and preparations need to be made
to handle it properly. This will help for the wide acceptance of UTF8.
My main conclusion for now: I would really like a way to reliably identify a
UTF8 encoded file. Is that already possible? If not, can it be added?
--
hundred-and-one symptoms of being an internet addict:
88. Every single time you press the 'Get mail' button...it does get new mail.
--/-/---- Bram Moolenaar ---- Bram@moolenaar.net ---- Bram@vim.org ---\-\--
\ \ www.vim.org/iccf www.moolenaar.net www.vim.org / /
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/