[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: wcwidth() implementation



> > This is mainly interesting for roundtrip conversion or for special styles.
> > Most CJK users will be happy to treat all alphabet-based characters as
> > single-width.
> 
> I don't know where you have this information from.  From all the talks
> I had recently it is clear that certainly is not the general opinion.

It is my opinion, based on my regular generating of multilingual documents
with CJK versions and my intensive use of especially Chinese and Japanese.

I can think of about the following cases where double-width latin-based
characters are needed

(1) punctuation characters like brackets, commas, periods.
(2) upper-case characters in abbreviations, according to styles preferred
    by some people, intended mainly to make sure that all CJK characters
    are in aligned blocks of exactly the same width, as used in draft
    paper (20 x 20 per page).  Even here, the alignement effect can be
    achieved without using doublewidth alphabetic characters, and it is
    a question of style.  Withint the same text, people will prefer 
    single-width for non-abbreviations, long words, quotations
    and the like. Within quotations, even single-width punctuation
    will be used.

So it seems the problem of single vs double width for these characters
can't be solved by a locale setting.

One simple way to solve it in Unicode is to use explicit double-width
characters when they are needed.  Another is to use a display system that
offers some simple style markup, like

<dw>.... (GDP) .... (... <sw>gross national product</sw> ...) ...
</dw>

This markup could include information about the text direction (e.g.
horizontal vs vertical), which also leads to the use of different shapes
for functionally identical punctuation characters.

Another item that belongs to this text style markup level is the handling
of ruby characters (illustratory superscripts used in Japanese).

This text style markup level is above the coding/locale level of choice
and below the hypertext level.  There should be an independent markup
standard for this, designed for use under

- any coding system (Unicode, Ascii, JIS, ...)
- any hypertext/display system (LaTeX, HTML, RTF, VT100)

Trying to handle this on a coding/locale level reminds me of some
discussions about using a BOM for file type recognition.

But where *do* we get our Basic Text Style Markup Language from, if we
aren't allowed to abuse some other level of choice ?
		
--
phm

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/