[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Testing for UTF-8 tty mode
Markus Kuhn wrote:
> It doesn't do any harm if the keyboard is in UTF-8 mode for these
> applications, on the contrary: "less" is entirely keyboard controlled by
> ASCII characters, and ASCII characters are encoded in UTF-8 also as
> ASCII characters. ISO 8859-* and UTF-8 are identical for the
> 0x0000-0x007f range, and this is the range that contains all "less"
> commands.
Hmm, I would expect switching the keyboard to UTF8 mode to take away some of
the "normal" keys, to allow entering "special" characters. How else would you
be able to enter more or different characters with the same keyboard? Or, in
other words, if the non-UTF8 mode is fully included in the UTF8 mode, why
would we ever want to use the non-UTF8 mode?
> However as soon as you want to enter a regular expression into less, you
> really want to have the keyboard also in UTF-8 mode, because otherwise
> you couldn't enter non-ASCII characters. It makes absolutely no sense to
> have the screen in UTF-8 mode but the keyboard not if you use less,
> because string search would become unusable for non-ASCII characters
> otherwise.
>
> Therefore, whenever you have the screen in UTF-8 mode, you also want the
> keyboard to be in the same mode.
Not necessarily. Only if you want to type a search string that contains
non-ASCII characters. Mostly I only type <Space> and <b> in less...
Also, there are other ways to enter non-ASCII characters. For example, by
holding the ALT key and typing the key code on the numeric keypad. No need to
switch the keyboard to UTF8 mode. Not that this is a nice solution, but it
does avoid problems with switching the keyboard to another mode.
Another thing is the input method. Is that independent from the encoding?
I suppose so. Perhaps someone can tell if all input methods work with all
encodings. I suspect it's not so. Then switching to UTF8 might disable the
use of a certain input method. At least until it is made to support UTF8.
> A terminal is a bidirectional communications channel, features like
> character echoing tie the both directions semantically very close
> together, therefore they obviously should always have the same encoding.
Only for typed characters. Characters send to the screen by an application
need not go the other way.
> > That doesn't solve the problem of having different types of files on my
> > harddisk.
>
> The big vision is to *not* to have plaintext files in different
> encodings on the harddisk. The day you switch your system to UTF-8, you
> run a big
>
> find . -type f -exec recode latin1..utf8 {} \;
>
> over your entire harddisk, and from then on, everything is in UTF-8.
Aha. Well, this is worse than switching from a.out to elf. Don't count on me
switching to UTF8 for 100% within the next decade. That command looks like it
might mess up some data files anyway, I wouldn't dare to let it run on my
system.
> You set LC_CTYPE=UTF-8 to tell every application that everything is in UTF-8
> now. Applications that process files received from the outside world (e.g.,
> email readers and web software) will see LC_CTYPE=UTF-8 and will
> convert received MIME "text/* ; charset=xyz" files into UTF-8 before
> saving them on the harddisk. This way, you never get again non-UTF-8
> files onto your system. If you do (e.g., from a floppy disc), use iconv,
> recode, etc. to fix it manually, just like you have to fix it manually
> today if you read an MS-DOS CP437 file from a floppy.
I can imagine a lot of problems. What if I have one application that doesn't
support UTF8, but does use non-ASCII characters? (Don't answer! :-)
> You will spot non-UTF-8 file quickly, because they look funny in your
> UTF-8 terminal emulator.
Can you define funny??
> You will therefore quickly convert them to
> UTF-8 as soon as you spot one and the problem will be fixed. Just as we
> eradicate MS-DOS CP437 files on our Linux partitions quickly by
> converting them to the system wide encoding. UTF-8 has the big advantage
> that no information is lost during these conversions, because Unicode is
> a superset of all other commonly used encodings.
As I said, problems...
I do understand that switching completely to UTF-8 makes a nice clean system.
I just don't see it happen for most users. It's better to fully prepare for a
mixed system. Let the developers handle the problems, avoid that the users
need to worry about them.
> > This depends on the specific file, not on the environment. The
> > only solution for this is that the file itself specifies its encoding.
>
> No, the idea is that there exists a global system encoding and that all
> files are converted into this encoding. UTF-8 is ASCII compatible, so
> pure ASCII files will not have to be touched at all. We definitely do
> not want to carry the MIME character-set-tag mess over into the Linux
> file system! We want to have only one single encoding to *avoid* having
> to tag a character set. This keeps everything neat and very simple.
I am currently using binaries that were compiled more than five years ago.
This switch to UTF8 probably means I have to get rid of those. That is not an
attactive option... At least with the switch from a.out to elf I was able to
recompile the programs. For the switch to UTF8 the sources need to be
changed. That is much more complicated.
> > For an editor, you might want to switch dynamically between different modes,
> > depending on the type of file being edited at the time.
>
> In the end, you want to have your editor always in UTF-8 mode, because
> all your files will be in UTF-8. Just like the Plan9 folks have done
> it for half a decade already.
Well, perhaps in another decade I will. But for now, and for most people,
there will be a mix of encodings.
> No, LC_CTYPE specifies the format of the file itself all the time.
> LC_CTYPE specifies the system-wide character encoding for all files,
> filenames, etc.
>
> You see in what horrible trouble you come if you suddenly try to
> introduce the notion of file types into Unix. If you had multiple
> encodings simultaneously in one system, the result of a "cat
> iso8859-1.txt iso8859-5.txt utf-8.txt" would not be processable by any
> software. Do you want to have to add a recoding functionality to
> cat-like applications? Certainly not.
There are already many different encodings being used. In Vim there is the
'fileencoding' option. Possible values currently are:
ansi default setting, good for most Western languages
japan set to use shift-JIS (Windows CP 932) encoding
korea set to use Korean DBCS
prc use simplified Chinese encoding
taiwan use traditional Chinese encoding
I intend to add "utf8" to this, that's why I am in this group.
I'm aiming at supporting a mixed environment, that is why I was asking how
this would work. Apparently you are aiming at a single-encoding environment.
That won't be of help to me then. Is this group just for making a
single-encoding system? In that case I better unsubscribe...
> > And what if I have one file system with (say, from Windows NT) and one
> > without (say from OS/2) that type? This probably requires specifying the
> > type to mount itself. LC_CTYPE could be used as the default though.
>
> For Windows NT, we have the official definition from Microsoft that its
> filenames are always in Unicode, so no problem here. I don't know about
> OS/2 files, but I would assume that the user of an OS/2 system has (like
> under DOS) to agree on one system-wide encoding used in all OS/2
> filenames (typically CP850, I'd expect), and we then have to manually
> tell under Linux the OS/2 FS mount command in what encoding the file
> names on the OS/ 2 partition should be interpreted. LC_CTYPE would tell
> the mount command, in what encoding the the filenames should be
> presented to Linux user processes such as "ls". Most of the code is
> actually already there, the only thing missing I think is to add UTF-8
> as yet another encoding to the OS/2 driver, and to make mount read and
> interpret LC_CTYPE.
OK, so the LC_CTYPE tells the OS/2 filesystem how to present characters
towards applications. Still remains the question of how to tell it what
encoding is used in the actual file system. Perhaps it's fixed, like with
NTFS, then it's simple.
By the way, note that Windows-NT mounts FAT file systems, which are, of
course, not unicode. Thus it must already do a translation. Only for the
file names though, not for the file contents!
--
hundred-and-one symptoms of being an internet addict:
73. You give your dog used motherboards instead of bones
--/-/---- Bram Moolenaar ---- Bram@moolenaar.net ---- Bram@vim.org ---\-\--
\ \ www.vim.org/iccf www.moolenaar.net www.vim.org / /
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/