[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU Emacs Unicode support



 > > I was referring to the range U+1100 to U+11f9, the conjoining Jamo.
 > > ... It doesn't make sense that the representation for choseong is
 > > full-width, while the jungseong and jongseong are half-width. (Even though
 > > EastAsianWidth-3.txt seems to indicate this.)
 > 
 > I see. Thanks for explaining. You don't agree with EastAsianWidth-3.txt.
 > So please ask on the unicode.org mailing list.

(I've pulled the discussion back on the list---I think wcwidth() needs
to be fixed, if we can agree on this.)

No, it's more subtle than that.  The definition in
EastAsianWidth-3.txt is confusing on first sight, but rational.  The
wcwidth() implementation is just plain wrong.

The conjoining Jamo are used to write Korean syllables.  The Jamo
elements themselves form a real alphabet, but due to the nature of
Korean writing, the renderer must combine each syllable into one
glyph.  In general, we can say that each syllable will have either CV
or CVF shape (CVF = consonant-vowel-final).  In particular, each
syllable will have one C.  Under this assumption, if you assign
column-width two to C and zero to V and F, the total width of any Jamo
sequence will be computed correctly.  This is like pretending that the
C are ordinary base characters, with V and F being combining
characters that are rendered "on top" of C.

This appears to be the rationale behind the data in
EastAsianWidth-3.txt.  Note that the file does not claim that V and F
are narrow---they are indicated as "neutral".  It appears to me that
the rationale of this definition is to consider C as "wide" and V and
F as "combining", but *only* for the purpose of column width
computation.

Jamo V and F are not marked as "combining" in UnicodeData, with good
reason.  That's simply not how Jamo composition works (note, by the
way, that the composition algorithm is a bit more complicated, and
does not require syllables to have CV or CVF shape.)

Markus' implementation of wcwidth() is based on the following
definition: chars marked as "combining" in UnicodeData get width 0,
those marked as "EastAsian-wide" get width 2, everything else gets 1.
The Jamo just don't fit this pattern.

What to do for wcwidth?  As long as we want to give an answer per
character (instead of parsing a whole string and separating it into
Hangul syllables), I think the only correct answer is to make

U+1100..U+115f return 2,
U+1160..U+11ff return 0.

This would give the correct result on well-formed text in a Jamo-aware
renderer.  A non-Jamo-aware renderer would presumably render each
element separately, and take the width of the chars with wcwidth()
zero from their base character.  That would render each syllable as a
sequence of (in general) two or three full-width Jamo elements, which
would be fine.

Otfried


-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/