[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: XTerm char-width handling
> > When characters can now be either 1 or 2 cells wide, what would be the
> > preferred semantics for the cursor control sequences? Should they
> > position the cursor on the n-th cell or on the n-th character in a line?
>
> In editors, only wchar-based cursor movement makes sense. Cells are not
> meaningful entities for an editor. In UTF even less so than in EUC.
One must be careful to distinguish between the observed behaviour, and
the implementation of the terminal interface. Indeed, inside an
editor all characters are created equal, pressing cursor-right must
move to the next character, regardless of its width.
This has no direct implication on the xterm interface, though. In the
current situation with East-Asian xterm's, it would be the editor's
responsibility (or, as Thomas points out, preferably the curses
library would take care of it). I would like to argue that this is
the right way to do it. Let me give two arguments.
First, even inside an editor, character width IS visible to the user.
Imagine having these two lines in an editor (the AA, BB, etc pairs are
double-width characters):
123456789
AABBCCDDEEFF
123456789
If you start with the cursor on the fifth character in the first row
(the "5") and move the cursor to the next line, it should be on the
double-width "CC", NOT on the fifth double-width character
"EE". Moving further down, it should again be on the "5". (Try this
in Emacs!) When users think in terms of columns, they think about
actual screen positions, not about the number of characters.
Second, let's take the problem a step further. What would you do if
you had to build a terminal that supports proportional fonts? Well,
you'd do exactly what X-terminals or Postscript printers do: the
protocol would be in terms of pixels, and the application would be
able to find out how wide each character is (from the server or font
metrics files).
What is needed is a way to move the current position to a given place
on the screen. On a character terminal, it is a useful simplification
to do this in terms of character cells, since they are all the same
size (and so all possible positions would be multiples of the cell
width). Looking at it this way, the correct way to implement
full-width and half-width characters is to do the addressing in terms
of the smallest possible unit---the half-width character.
As in the X-terminal/Postscript case, it is mandatory that the
application knows which characters are wide and which are not. We can
fix some set of wide-characters and hope that everybody else is going
to stick to it, but there are some strings attached---future Unicode
updates may make it necessary to update the set of wide chars. (In the
current proposal for "iswide", at least 0x20000 .. 0x2ffff should be
added, which is already reserved for ideographs.) Alternatively, we
could use "font metric files" as is usual for software generating
Postscript output. Or we could do as for X-applications, and allow
the application to interrogate the terminal to find out which
characters are wide (it is already possible to interrogate the current
cursor position, so in principle this is already possible---but it is
probably too clumsy and slow to be useful in practice).
By the way, we should not be calling this "East-Asian width", as we
are NOT trying to implement the behaviour specified in Unicode TR#11.
The East-Asian width specification specifies backwards compatibility
in East-Asian contexts: If I have a nicely formatted plain-text file
encoded in ShiftJIS, convert it to UTF-8, and view it with xterm, it
should come out as nicely formatted as it was. This is not our goal,
we do not want to stick slavishly to the wide paragraph sign and wide
Greek letters in these legacy encodings!
Here's my proposal for cursor movement/addressing, very much in line
with what current East-Asian Xterm's do, but removing the reliance on
"bytes":
We consider the screen a matrix of character cells. A
character cell can either hold a narrow character X, or half a
wide character (let's call such an item X_{left} or X_{right}).
All cursor movement (addressing, relative movement) is in terms
of character cells.
When writing a narrow char to the screen, it will fill the
current char cell. When writing a wide char X to the screen,
this is equivalent to writing the two narrow "chars" called
X_{left} and X_{right}. (There is no way to write half a char).
The display will combine pairs of X_left & X_right occuring in
adjacent cells and display them as the wide char X. All other
(not properly matching) occurences of X_left or X_right are
displayed as U+303F. (This should never happen in compliant
apps.)
When the cursor is on a char cell displayed as a wide char, it
becomes twice as wide. (When it is on the second cell, we may
want to indicate this visually, although it should never happen
in compliant apps.)
This will require that even very simple line editors need to output
TWO backspaces to rub out the last wide character. This is arguable,
but seems to be the clean thing to do. After all, the line editor
already needs to be able to parse UTF-8 to to know how many bytes to
remove from its buffer.
Otfried
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/