[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF-8 out-of-the box experience



Kazixo!

On Thu, May 03, 2001 at 04:16:57PM +0100, Markus Kuhn wrote:
 
>> The problem is primarly that the source of the man pages are not in utf-8.
> 
> No, this is not at all the problem! Groff -Tutf8 on a non-utf8 file
> produces already perfectly nice UTF-8 files from ASCII man pages.

For ascii yes :) as it happens to be invariant in this case.
I thought about french, russian, etc pages.

> I want to see first of all ASCII English man pages to look correctly in
> UTF-8 mode. These break already. It is far too early to worry about
> non-English man pages at this time.

I don't agree, those are the ones that most need that feature.
English only pages are perfectly readable in a misconfigured system;
as the only thing not proeperly showing will be the bullets and hyphens,
not a big deal.

>> That is, the man page viewers have to be modified in order to be able
>> to convert encodings.
> 
> At the moment, you get garbage with English ASCII. EUC-JP support for
> groff is a completely different topic, because PostScript doesn't
> support Japanese without additional fonts, etc. Let's first of all get
> the normal standard PostScript repertoire supported from English ASCII
> groff input.

I didn't mean postscript (-Tps), but simply that plain tty output works
for other thing than ascii.
groff is able to handle iso-8859-1 and convert it to utf-8 while formatting.
why would it not be possible to also handle utf-8 (no conversion needed)
as well as any charset->utf-8 conversion (and maybe even also a raw 8bit
formatting; like -Tlatin1)

>> Have you tested that idea with Russian or Japanese man pages?
>> Eg: man pages in koi8-r displayed under an UTF-8 locale.
> 
> The standard postscript fonts do not support Russian or Japanese, so
> what is the point??

I suppose someone using PostScript on a Russian or Japanese environment
has correct Postscript fonts; but I was thinking of plain text, 
not postscript

> Please remember that groff is primarily a tool to
> produce formatted PostScript output.

Well, I wonder. It seems nowadays it is used (at least for non English text)
primarly for man page online formatting.

> > perl -e 'use utf8; print "\x{20ac}\b\x{20ac}\x{2203}\b___\n"' | less
> > 
> > works.
> 
> Which is definitely not how it should work. Less has to understand that
> \b moves back on a terminal one character, not one byte.

The problem is not with \b, that bit worked with the patch I had.
The problem is the 'underline' property only applied to the first *byte*
of the char to be underlined.
 
>> the most annoying thing is the broken
>> fontset support at XFree86 level, if that were fixed then automatically
>> all programs using fontsets (eg: all of Gnome, Windowmaker, etc) will start
>> displaying nice unicode out of the box.
> 
> Has this been identified and fixed in the XFree86 4.1 snapshot?

I don't know, I haven't tried it yet.

>> There is also another bug, in 'ls'.
>> Try a 'touch somefile' with a utf8 name, then doing an ls; you will see
>> only '?'.
> 
> I can't reproduce that problem under RH 7.1. "ls" seems to work
> correctly. I produced a number of files with (normal-width) non-ASCII
> characters, and their display and column arrangement looks very nice.
> Are you sure you have successfully selected an existing UTF-8 locale
> (setlocale() didn't return an error?).

It apparently is fixed in fileutils 4.1; I have fileutils 4.0, and yes
it is a bug in 4.0, it is reproductible with all multibyte encodings
(while single byte encodings work just fine)

> When I just tried to enter 
> 
>   touch äää
> 
> to test ls, it also became apparent that in RH 7.1 bash (and readline?)
> break in a UTF-8 locale.

indeed.
note that it works with euc-jp (and probably with other CJK locales, didn't
tested), strange.

>> Is it ok to post patches in this ml ? 

ok.
here it is.


> Sure. (And probably also to bug-less@xxxxxxxx)

already sent.

> Does your patch also do biwidth output correctly?

I don't know (have you a text sample I can use?).
The patch was from Alair McKinstry, but the underscore handling
was wrong, I added the 'underline' bits and the 'while (overstrike)' only.

-- 
Ki ça vos våye bén,
Pablo Saratxaga

http://www.srtxg.easynet.be/		PGP Key available, key ID: 0x8F0E4975
--- less-358/line.c.utf8	Sun Jul  9 02:26:46 2000
+++ less-358/line.c	Thu May  3 19:20:16 2001
@@ -32,6 +32,7 @@
 static int column;		/* Printable length, accounting for
 				   backspaces, etc. */
 static int overstrike;		/* Next char should overstrike previous char */
+static int underline=0;
 static int is_null_line;	/* There is no current line */
 static int lmargin;		/* Left margin */
 static char pendc;
@@ -286,8 +287,10 @@
 	static void
 backc()
 {
-	curr--;
-	column -= pwidth(linebuf[curr], attr[curr]);
+        do {
+		curr--;
+		column -= pwidth(linebuf[curr], attr[curr]);
+	} while (utf_mode && IS_CONT(linebuf[curr]));
 }
 
 /*
@@ -312,6 +315,17 @@
 	return (0);
 }
 
+int utf8_seq_length(unsigned char startbyte) {
+         if (startbyte < 0x80) return 1;
+         if (startbyte < 0xc0) return 0; /* what to do about invalid input? */
+         if (startbyte < 0xe0) return 2;
+         if (startbyte < 0xf0) return 3;
+         if (startbyte < 0xf8) return 4;
+         if (startbyte < 0xfc) return 5;
+         if (startbyte < 0xfe) return 6;
+         return 0; /* what to do about invalid input? */
+}
+
 /*
  * Append a character and attribute to the line buffer.
  */
@@ -457,9 +471,15 @@
 			if (curr == 0)
 				break;
 			backc();
-			overstrike = 1;
+			if (utf_mode)
+				overstrike = utf8_seq_length(linebuf[curr]);
+			else
+				overstrike = 1;
 			break;
 		}
+	} else if (underline>1) {
+		STOREC(c, AT_UNDERLINE);
+		underline--;
 	} else if (overstrike)
 	{
 		/*
@@ -469,14 +489,22 @@
 		 * bold (if an identical character is overstruck),
 		 * or just deletion of the character in the buffer.
 		 */
-		overstrike = 0;
+		overstrike--;
 		if ((char)c == linebuf[curr])
 			STOREC(linebuf[curr], AT_BOLD);
-		else if (c == '_')
+		else if (c == '_') {
 			STOREC(linebuf[curr], AT_UNDERLINE);
-		else if (linebuf[curr] == '_')
+			while (overstrike) {
+				STOREC(linebuf[curr], AT_UNDERLINE);
+				overstrike--;
+			}
+		} else if (linebuf[curr] == '_') {
 			STOREC(c, AT_UNDERLINE);
-		else if (control_char(c))
+			if (utf_mode) {
+				underline = utf8_seq_length(c);
+				overstrike = 0;
+			}
+		} else if (control_char(c))
 			goto do_control_char;
 		else
 			STOREC(c, AT_NORMAL);