[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF-8 out-of-the box experience



Kaixo!

On Thu, May 03, 2001 at 09:47:32AM +0100, Markus Kuhn wrote:

> Our department upgraded machines yesterday to the brand new Red Hat 7.1
> release. Here a few impressions I collected while I demonstrated the
> UTF-8 capabilities to my colleagues. UTF-8 locales are available now and
> 
>   % LANG=en_GB.UTF-8 xterm &
> 
> is all that is needed to enter the Unicode world.

Well, there are still various problems, but I agree that the glibc side
is now fully correct.

> The combination of "man" (version 1.5h) and "groff" (GNU troff version
> 1.16.1) is seriously broken in a UTF-8 locale. Even for ASCII only web
> pages, groff inserts Latin-1 SHY bytes, which result in an ugly
> malformed UTF-8 sequence. It is very disappointing that this doesn't
> work correctly out-of-the-box, because the underlying groff mechanics
> for UTF-8 output is already in place and seems to work correctly:

The problem is primarly that the source of the man pages are not in utf-8.
That is, the man page viewers have to be modified in order to be able
to convert encodings.
groff and the source files can be improved to be able to detect the encoding
used and correctly format it; but that won't help to correctly display it;
if the source is in euc-jp, it won't show properly in utf-8; and if the source
is in utf-8, it won't show properly if the user uses euc-jp.

That is a client side problem.

>   zcat /usr/share/man/man7/groff_char.7.gz | groff -mandoc -Tutf8 - | less
> 
> produces the desired results, whereas
> 
>   man groff_char
> 
> does not.

man has to get patched.

But, does using groff -Tutf8 on a non-utf8 file converts it to utf-8?
In other words, the -T parameter tells the encoding of the file or the
encoding to use in the output?

> The required fix here is that groff should get a new output device
> -Tplaintext which specifies plaintext encoded according to the current
> locale (just query nl_langinfo(CODESET) and see whether it says "UTF-8"
> or "ISO-8859-*" or something like that). Then in /etc/man.config, we
> could simply replace
> 
>   NROFF           /usr/bin/groff -Tlatin1 -mandoc
> 
> with
> 
>   NROFF           /usr/bin/groff -Tplaintext -mandoc
> 
> and man would automatically work properly in both ISO-8859 and UTF-8
> locales.

Have you tested that idea with Russian or Japanese man pages?
Eg: man pages in koi8-r displayed under an UTF-8 locale.

I'm afraid your solution will work only for plain ascii pages.

> "less" (less 358+iso247) is also still broken and completely messes up
> in UTF-8 mode the handling of backspace boldification used by nroff.
> This still distorts the output of any man page. Test case:
> 
>   perl -e 'use utf8; print "a\ba_\bb\n"' | less
> 
> correctly shows a bold "a" and an underlined "b", but
> 
>   perl -e 'use utf8; print "\x{20ac}\b\x{20ac}_\b\x{2203}\n"' | less
> 
> fails to show either a bold euro sign or an underlined there-exists sign.
> (Perl 5.6 or newer required here)

Here the bold euro shows correctly, but the underscore doesn't.
I think the problem is that the '_' is an ascii char (width=1) while
U2203 is 3 bytes long.

perl -e 'use utf8; print "\x{20ac}\b\x{20ac}\x{2203}\b___\n"' | less

works.

Tested with:

test:~$ rpm -q perl less
perl-5.600-30mdk
less-358-8mdk

> UTF-8 locale support under X11 (XFree86 4.0.3) also seems still *very*
> broken. For example, I would have hoped that
> 
>   perl -e 'use utf8; print "\x{20ac}"' | xmessage -file -
> 
> (all under LANG=en_GB.UTF-8) shows me a window with the euro sign, but
> what I get instead is display of "â\202¬". :-(

That is a fontset problem in fact.
The sad thing is that it *used to work*; sometime in 4.0.3 or 4.0.2
I remember having been very positively impressed to be able to launch
programs and display in unicode by just editing 
/usr/X11R6/lib/X11/locale/en_US.UTF-8/XLC_LOCALE and adding several fontset
definitions.

But it doesn't work anymore.

> I also tried vi quickly (VIM 6.0z ALPHA) with LANG=en_GB.UTF-8, but when
> I used "vi UTF-8-demo.txt", I just got garbled text on the screen. man
> vi did not contain the search string "uni" or "utf". Couldn't figure out
> whether the vim 6.0z that comes with RH 7.1 has any UTF-8 support. It
> certainly didn't work out-of-the-box.

I tested with:

test:~$ rpm -q vim-enhanced
vim-enhanced-6.0-0.12mdk

and it worked correctly.

> Summary: Red Hat 7.1 is not even suited to make a 5 min demonstration of
> its UTF-8 locale support without serious embarrassment. xterm is pretty
> much the only UTF-8 application that works at the moment.

No, there are several other applications; the most annoying thing is the broken
fontset support at XFree86 level, if that were fixed then automatically
all programs using fontsets (eg: all of Gnome, Windowmaker, etc) will start
displaying nice unicode out of the box.

There is also another bug, in 'ls'.
Try a 'touch somefile' with a utf8 name, then doing an ls; you will see
only '?'.
it works with 1-byte encodings (koi8, iso8859, etc), but noty with multibyte
(utf-*, euc, etc).
I wonder if thzt isn't a bug of isprint() or something like that.

> Required action:
> 
> - fix less backspace bug

the mdk version includes an utf-8 patch; I improved it so underscore works
ok too;

Is it ok to post patches in this ml ? 


-- 
Ki ça vos våye bén,
Pablo Saratxaga

http://www.srtxg.easynet.be/		PGP Key available, key ID: 0x8F0E4975

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/