[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: UTF-8 out-of-the box experience
Kazixo!
On Thu, May 03, 2001 at 04:16:57PM +0100, Markus Kuhn wrote:
>> The problem is primarly that the source of the man pages are not in utf-8.
>
> No, this is not at all the problem! Groff -Tutf8 on a non-utf8 file
> produces already perfectly nice UTF-8 files from ASCII man pages.
For ascii yes :) as it happens to be invariant in this case.
I thought about french, russian, etc pages.
> I want to see first of all ASCII English man pages to look correctly in
> UTF-8 mode. These break already. It is far too early to worry about
> non-English man pages at this time.
I don't agree, those are the ones that most need that feature.
English only pages are perfectly readable in a misconfigured system;
as the only thing not proeperly showing will be the bullets and hyphens,
not a big deal.
>> That is, the man page viewers have to be modified in order to be able
>> to convert encodings.
>
> At the moment, you get garbage with English ASCII. EUC-JP support for
> groff is a completely different topic, because PostScript doesn't
> support Japanese without additional fonts, etc. Let's first of all get
> the normal standard PostScript repertoire supported from English ASCII
> groff input.
I didn't mean postscript (-Tps), but simply that plain tty output works
for other thing than ascii.
groff is able to handle iso-8859-1 and convert it to utf-8 while formatting.
why would it not be possible to also handle utf-8 (no conversion needed)
as well as any charset->utf-8 conversion (and maybe even also a raw 8bit
formatting; like -Tlatin1)
>> Have you tested that idea with Russian or Japanese man pages?
>> Eg: man pages in koi8-r displayed under an UTF-8 locale.
>
> The standard postscript fonts do not support Russian or Japanese, so
> what is the point??
I suppose someone using PostScript on a Russian or Japanese environment
has correct Postscript fonts; but I was thinking of plain text,
not postscript
> Please remember that groff is primarily a tool to
> produce formatted PostScript output.
Well, I wonder. It seems nowadays it is used (at least for non English text)
primarly for man page online formatting.
> > perl -e 'use utf8; print "\x{20ac}\b\x{20ac}\x{2203}\b___\n"' | less
> >
> > works.
>
> Which is definitely not how it should work. Less has to understand that
> \b moves back on a terminal one character, not one byte.
The problem is not with \b, that bit worked with the patch I had.
The problem is the 'underline' property only applied to the first *byte*
of the char to be underlined.
>> the most annoying thing is the broken
>> fontset support at XFree86 level, if that were fixed then automatically
>> all programs using fontsets (eg: all of Gnome, Windowmaker, etc) will start
>> displaying nice unicode out of the box.
>
> Has this been identified and fixed in the XFree86 4.1 snapshot?
I don't know, I haven't tried it yet.
>> There is also another bug, in 'ls'.
>> Try a 'touch somefile' with a utf8 name, then doing an ls; you will see
>> only '?'.
>
> I can't reproduce that problem under RH 7.1. "ls" seems to work
> correctly. I produced a number of files with (normal-width) non-ASCII
> characters, and their display and column arrangement looks very nice.
> Are you sure you have successfully selected an existing UTF-8 locale
> (setlocale() didn't return an error?).
It apparently is fixed in fileutils 4.1; I have fileutils 4.0, and yes
it is a bug in 4.0, it is reproductible with all multibyte encodings
(while single byte encodings work just fine)
> When I just tried to enter
>
> touch äää
>
> to test ls, it also became apparent that in RH 7.1 bash (and readline?)
> break in a UTF-8 locale.
indeed.
note that it works with euc-jp (and probably with other CJK locales, didn't
tested), strange.
>> Is it ok to post patches in this ml ?
ok.
here it is.
> Sure. (And probably also to bug-less@xxxxxxxx)
already sent.
> Does your patch also do biwidth output correctly?
I don't know (have you a text sample I can use?).
The patch was from Alair McKinstry, but the underscore handling
was wrong, I added the 'underline' bits and the 'while (overstrike)' only.
--
Ki ça vos våye bén,
Pablo Saratxaga
http://www.srtxg.easynet.be/ PGP Key available, key ID: 0x8F0E4975
--- less-358/line.c.utf8 Sun Jul 9 02:26:46 2000
+++ less-358/line.c Thu May 3 19:20:16 2001
@@ -32,6 +32,7 @@
static int column; /* Printable length, accounting for
backspaces, etc. */
static int overstrike; /* Next char should overstrike previous char */
+static int underline=0;
static int is_null_line; /* There is no current line */
static int lmargin; /* Left margin */
static char pendc;
@@ -286,8 +287,10 @@
static void
backc()
{
- curr--;
- column -= pwidth(linebuf[curr], attr[curr]);
+ do {
+ curr--;
+ column -= pwidth(linebuf[curr], attr[curr]);
+ } while (utf_mode && IS_CONT(linebuf[curr]));
}
/*
@@ -312,6 +315,17 @@
return (0);
}
+int utf8_seq_length(unsigned char startbyte) {
+ if (startbyte < 0x80) return 1;
+ if (startbyte < 0xc0) return 0; /* what to do about invalid input? */
+ if (startbyte < 0xe0) return 2;
+ if (startbyte < 0xf0) return 3;
+ if (startbyte < 0xf8) return 4;
+ if (startbyte < 0xfc) return 5;
+ if (startbyte < 0xfe) return 6;
+ return 0; /* what to do about invalid input? */
+}
+
/*
* Append a character and attribute to the line buffer.
*/
@@ -457,9 +471,15 @@
if (curr == 0)
break;
backc();
- overstrike = 1;
+ if (utf_mode)
+ overstrike = utf8_seq_length(linebuf[curr]);
+ else
+ overstrike = 1;
break;
}
+ } else if (underline>1) {
+ STOREC(c, AT_UNDERLINE);
+ underline--;
} else if (overstrike)
{
/*
@@ -469,14 +489,22 @@
* bold (if an identical character is overstruck),
* or just deletion of the character in the buffer.
*/
- overstrike = 0;
+ overstrike--;
if ((char)c == linebuf[curr])
STOREC(linebuf[curr], AT_BOLD);
- else if (c == '_')
+ else if (c == '_') {
STOREC(linebuf[curr], AT_UNDERLINE);
- else if (linebuf[curr] == '_')
+ while (overstrike) {
+ STOREC(linebuf[curr], AT_UNDERLINE);
+ overstrike--;
+ }
+ } else if (linebuf[curr] == '_') {
STOREC(c, AT_UNDERLINE);
- else if (control_char(c))
+ if (utf_mode) {
+ underline = utf8_seq_length(c);
+ overstrike = 0;
+ }
+ } else if (control_char(c))
goto do_control_char;
else
STOREC(c, AT_NORMAL);