Kaixo! On Mon, Jun 14, 2004 at 08:43:38AM -0700, Elvis Presley wrote: > Unicode Keyboard Input Linux In fact unicode (trough utf-8 of course) mostly works on the console. The drawbacks are currently tied to the nature of the console (in the current text mode) and not to the encoding. The main drawbacks are: - display is limited to up to 512 different glyphs; it is enough for most alphabetic languages; but it is not enough for CJK languages, for example. - display is limited to 1 char=1 glyph=1 cell paradigm; that means languages like Thai, where a suite of chars can have their glyphs stacked one up the other in a single cell will display horribly; languages needing glyph recomposition like those using indic alphabets are simply impossible. Note that even some languages using latin alphabet are hurt, as they use some accented letters not present in unicode which are encoded as base letter and composing accent. THe difference whith xterm-like terminals here is very huge; on X11 powerfull font functions are available, and there are text terminals that are able to nicely display scripts where 1 char is not necessarly equal to 1 glyph and not necessarly equal to 1 cell; and you are not limited to number of glyphs, so you can write in chinese without problem. Plus, the resolution is much better, and the range of available and choosable fonts much, much, much wider. There are also input problems in console. Typing directly unicode chars (with 1 keystroke = 1 char) is not a problem at all (it is just tedious to write the keymaps, and if you want to support both utf-8 and one or several old encodings, you have to provide a different keyboard file for each encoding; that is very bad, it would be much better to be able to have a single keyb description file, in unicode, and just tell to loadkeys the character set wanted (the default being whatever the glibc says is the default for the current locale). For composing however it's bad; kernel composing tables use "char", and so it is not possible to properly use dead keys or compose key while in unicode in the console (if you compose only chars also in the iso-8859-1 character set, it more or less work, you just have to type an extra keystrke, which is lost in outer space; but I doubt it will work for other chars, I suspect the fact it "mostly work" for some chars is because their iso-8859-1 8 bit code is the same numeric value as their unicode code). For languages needed help of an input method, the console is mostly unusable (it would be very nice to be able to have a single input method backend usable on both the console and X11; but so fat I know of none that does that and that is usable and widely used). Input works (almost) perfectly on X11 (the problem is due to the input framework of XFree86 that doesn't allow to switch input methods; so you cannot type some words in korean, then switch to chinese input... but some programs have started to bypass it, and xorg seems to use an input framework that solves that long standing annoyance). And output works on X11. So it could be a good think to have the engine of a good xterm-like terminal be used for the console, of course removing any unneeded linking to X11 libs; and it would solve a lot of things. Of course it would only work on screens with graphical capabilities, not on real vt100, French minitels or hp48 screens; but nobody is expecting to be able to write in devanagary in such devices I think. > The real console is essentially a graphical device, Not always. Not on some local screens on old PCs (it has always been a graphical device on locale screens for all non-PCs ports of linux; but for the PC itself the text mode in the local screen as graphical device is something quite new (you can look at when the "framebuffer" appeared on the i386 branch of linux to see the exact date). Also, you can redirect the console to another device than a local screen (again, it was there first on non-PCs branchs, I think the SUN ports were first; on PC you can redirect the console to a serial port) In fact, whether the console is physically a graphical device or not, for the operating system it is not; it is just text. That doesn't mean there couldn't been a graphical device, nor that such device couldn't be used for the console, nor that such graphical console couldn't do nice graphical things with text, like it is done on modern xterms on X11. But that is not done trough the normal I/O channels; programs see the console just as a text device, and send text flows, with some control codes to place cursor, change color, etc; but there is no way to play with individual pixels at the console I/O API for example. > with screen(=display), keyboard and mouse, and > whatever else might be considered interesting... > Applications do not open the real console directly, > but in theory, they could --in DOS they could: the > interface could be made public; there would have to be > a device special file for the real console, and the > virtual consoles too, and the pseudo terminals... Have > I forgotten anything? It seems you are calling "console" what I would call "framebuffer". For me "console" is the system that allows the kernel to display text and get output locally; the /dev/console > You could not really use keymaps in a traditional tty > configuration anyway, because the ascii terminal can't > display unicode characters, keymaps and ascii-only are irrelevant. You can writte in unaccented French or German in ascii-only (it is ugly, oit s bad, but it is possible) and yet want to use a French or German keyboard layout. Or simply you want to write in English in ascii only, but you like dvorak layout... > Of course, the tty module still must understand > unicode. I don't think this is a big problem, beacuse > the basic repetoire remains the same (=ascii) thanks > to the utf-8 encoding, but I'm sure there a few hidden > traps. A lot. ascii does a lot of assumptions that are simply false in utf-8: 1. one char = 1 byte (that is false in utf-8 after U+007F) 2. one char = 1 cell (that is false, see combining diacritics etc) 3. one char = glyph (that is false again, arabic char "noon" has 4 different looking shapes depending on what comes after and before it) 4. text is written left to right (that is false for several scripts and languages; some scripts are even truly bidirectinal; and there are even scripts written vertically only (not CJK which can also be written horiezontally) and are currenlty completly unsupported (but encoded in unicode) 5. Del and Bacskpace are similar (false again, a Del removes the content of a cell, which can be several chars (and one char can be several bytes too); while Backspace only removes one char) 6. text selection in bidirectional environment is hairy etc. of course, even a minimalistic utf-8 support is better than nothing; but saying "understand unicode" is a misleading thing; there are various different levels of understanding possible, and various different levels of support possible; things aren't as simple as with ascii. > Anything (module or program) which opens the master > side of a pseudo-terminal is called a terminal > emulator, therefore a 'vc' and an 'xterm' perform the > same function, but in different spaces. I wonder how > much of the software can be reused. You need vc's in > the kernel in the absence of X, to support Linux > virtual terminals. No, you don't. It would be perfectly ok to provide only very minimalistic kernel support (even simpler and lighter than the current one) and have a user space 'vc' loaded early in the boot process. In fact it would be much saner. [...] > Comparing characters would be easy, they compare as > unsigned integers, but sorting them would be a > problem, because you'd want to group all the > (accented) vowels together, according to language > specific rules. That is not new to unicode, it was already the case with other encodings, including ascii. And it is completly irrelevant of console/terminal anyway. > In Greek, this wouldn't be a problem, > because monotonic vowels and polytonic vowels, No, it's not a problem if the proper sorting rules are used (you choose them with the LC_COLLATE variable). I don't know how accurate the sorting of polytonic letters is with currently used greek locales; but that is easy to fix anyway; the problem is not technical at all. > The editor 'vi' would have to be modified to get/put > wcar_t, so I don't understand why you'd need a > separate unicode editor, or separate unicode > application, whatever it might be. The problem with text editors is the same as with command line editing: cursor positioning and character deletion. With unicode you cannot asume anymore that 1 char = 1 byte = 1 cell. Editors assuming 1 byte = 1 char are irremediably broken anyway; and completly unusable (the cursor displays at a completly different place from where it really is!); character selection and deletion is also complicated by the fact that 1 cell can be made of several characters. And there is bidirectionality problems as well. So editors that are deficient in some of those aspects need to be fixed; or replaced with other editors able to do the job. plain "vi" is fine to handle raw ascii, but useless to edit real human text in utf-8. (vim on the other hand is decent) > 1) Does 'sort' work on utf-8 input? yes. > 2) Does 'grep' (Unix search) work on utf-8 input? yes. > 3) Is there a laundry list or Unix filters which need > to be changed to support Internationalization? I know > 'cat' doesn't. I don't know. > Why do Greek newspapers still use ISO 8859-7? For the same reason that a majority of English language web sites still use windows-1252, I suppose. > Since utf-8 doubles the size of a file, It doesn't; it depends of the text; but anyway, even at the worst case, the increase in size for text is ridiculous compared to the huge size taken by images, sound, video, etc. > it looks like > these older character sets will be around for a long > time. Yes, but not for that reason; they are around because there is a lot of *OLD* data in those encodings, and it needs to be supported. But charset encoding is, for a majority of end user, a moot point, they simply don't care, nor do they even know what encoding is used; they just see text on screen, that's all; it is the program that does any charset conversion for them, if needed. Note that nowadays, a majority of programs have already switched to use unicode internally. > Unicode is a much nicer solution, except it's > prejudiced against non-english speakers. ??? It's exactly the opposite! Unicode is of all existing charset encodings the only one that is not prejudiced against any particular language. > All tags are > ascii, but the content can be anything, just switch > keymaps, no need to tag the content again. However, > double the size of the file and you double the > download time too. Now you need a server twice as big. I suggest you look at the size of your html files and you image/sound/etc images on your typical web server; even if doubling the size of the html files, the percent icnrease in total is small. And you don't double the size, there is a lot of html tagging in ascii that just doesn't change. In fact, in some cases the size may decrease, if you replace a bunch of ugly &html;&mar;&ent;&iti;&es; &ug;&ly; &as;&hell; with real and readable utf-8 characters. Note also that the same "size increase" argument was present when 7bit encodings (like ascii or koi7) had been replaced with 8bit ones (like cp1252, iso-8859-7 or koi8-r), yet the new 8bit encodings were overwhilmingly used, simply because the 7bit only was too limitative. > It looks to me like the most important distinction > between locales is not language, but national currency > symbol. Not for countries using euro currency :) differences are a combination of both language and national preferences. some things are very largely on the side of language difference; like sorting order (LC_COLLATE), uppercase/lowercase changing, definition of what is a letters, etc (LC_CTYPE), or the date format (LC_TIME); other are more influenced on political boundaries, like monetary conventions (LC_MONETARY), paper size, or telephone number notation. > 1) What are utf-8 locales? I would have thought that > utf-8 would be applicable across all locales. No; each locale defines an encoding. en_US.ISO-8859-1 and en_US.UTF-8 are not the same > Hypothesis: There could be an iso 8859 locale and a > unicode locale for the same "region" for historical > reasons. Yes; the simple fact that non utf-8 encodigns existed (and still exist) makes it necessary to recongnize them. > This is causing the confusion. I've never > worked in Latin-1, or Latin-2, just ascii and unicode, I very much doubt you worked in "just ascii" (maybe in 1969; but clearly not in the 1980s, much less in the 1990s); most probably you were using one of cp437 (in DOS) or iso-8859-1 (in unix) > and I don't even want to think about using a different > copy of the same program for each. ?? There is no confusion; nor need for different copies. The situation is acutally quite simple: a program either is internationalized, or it is not. If it is not, it is broken, are doomed to die as people stop using it. If it is internationalized, then the character encoding it will use for I/O will be transparent to the user, it will follow the locale a work smoothly. > Now, what about left->right and double-column > characters? And don't forget the zero-column ones. > Can I run a copy of X windows in an xterm? an xterm is a text terminal only. > Is there a version of X which runs as a Microsoft > Window A lot of them actually. I once use XWin32 to launch X clients from a unix box and display them on the screen of a win95 machine that had a much better screen and graphical card. > Is there a version of Linux which runs as a Microsoft > Window (not cygwin)? ?? What you say doesn't make sense. (you can on the other hand run an operating system inside of a virtual computer box inside another operating system) -- Ki ça vos våye bén, Pablo Saratxaga http://chanae.walon.org/pablo/ PGP Key available, key ID: 0xD9B85466 [you can write me in Walloon, Spanish, French, English, Catalan or Esperanto] [min povas skribi en valona, esperanta, angla aux latinidaj lingvoj]
Attachment:
pgp00003.pgp
Description: PGP signature