[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Unicode and man/groff/less problems
Tomohiro KUBOTA wrote on 2001-03-02 11:43 UTC:
> It is true we have collection packages of translated manpages. (For
> example, Debian 2.2 has German, Spanish, Finnish, French, Hungarian,
> Italian, Japanese, Korean, and Polish collection of manpages.) I think
> you are saying about such collections. However, there are also many
> softwares which includes non-English manpages written by non-English-
> speaking member of developers' group for such softwares and so on,
> just like each package has '.po' files. Thus, it will be a large
> amount of labor to convert them.
Still, we can encourage people to add apart from the normal "install-man"
also a Makefile target "install-man-utf8" that passes every man page
through iconv before installing it under /usr/man/.
It hurts my ambitions to make the world simpler by convincing everyone
to switch to UTF-8 if the consequence is that the world becomes more
complicated by adding additional character set conversion mechanics into
all levels and layers. This is just yet another example.
> (Do you know the number of open-source softwares in the world?
> Of course I don't know. SourceForge has 16558 projects now but
> it is obvious they are only a part of open source softwares in
> the world.)
Most of these 16558 projects are defunct.
"Don't trust a statistic you
haven't faked yourself."
> At least I wrote three Japanese manpages which are included in
> corresponding software packages and not included in such "Japanese
> manpages collection".
Could you add a "make install-man-utf8" option to that package? Shouldn't
require more than 5-6 lines in the Makefile.
> > - Man page maintainers do not need to use a UTF-8 editor. They can
> > keep things in their traditional encoding and just add to their
> > Makefiles an option to apply iconv at installation time.
>
> You have not explained why your opinion is better though your opinion
> needs such "iconv".
Iconv is only needed once at installation time and also only until
authors start to use UTF-8 editors and keep the master versions of the
man pages in UTF-8.
> My opinion doesn't need such "iconv".
It needs iconv to be called each time by groff forever.
> Yes, thus, implementing a mechanism to read encoding tag for manpage
> reader software is easy.
Easy in the sense of the MIME/ISO 2022 philosophy of tagging everything
under the assumptions
- that there will never be a single global encoding,
- that Emacs is is the only editor that you will ever need,
- that Unix's notion and aim of a type free file system was a mistake
to start with
We have gone through these philosophical differences already a couple of
times. Man is not different here. This time you probably just forgot
about the mess that you will create in secondary tools such as
"whereis", "apropos", "whatis" which now would have to associate
character sets in their indexing mechanisms, etc.
> I don't insist that all manpages must have encoding tag. I feel you
> misunderstand my opinion on this point.
May be. I just don't understand your fascination with keeping on the
harddisk everything in different encodings, if the reading application
has to transform read files internally into UCS anyway. Doing the
transform to UTF-8 at man page installation time seems an obvious
optimization and simplification to me.
> In short, because there are many manpages while only few softwares to
> read them, it is easier to modify softwares than manpages.
It's just a single-line iconv call we are talking about.
> I don't understand what is the merit of your opinion. Your way need
> - conversion of collection manpages
> - re-education of manpage writers all over the world
> - sudden (not gradual) migration to UTF-8 because your opinion
> doesn't support manpages written in EUC-JP, ISO8859-1, ISO8859-2,
> KOI8-R, or so on
You would also have to tag all non-ASCII manpages, otherwise how should
groff in a UTF-8 locale know, what the input character set is? Why is
tagging cheaper than conversion?
I see this primarily in the context of large commercial distributions
(mostly SuSE and RedHat) with major releases. They just need someone to
go through their SRPM database and scan for non-ASCII man pages (can be
automated to a large extent), and add to the installation scripts in
these SRPMs a few lines with iconv calls to make sure only ASCII or
UTF-8 man pages get installed. Sounds like a 10-20 person-day project to
me for a distribution of the size of SuSE 7.1. Boring but doable. Then
man will be compiled with an option to treat all man pages by default as
UTF-8. Users will be upgraded atomically when they install the next
release of their distribution. May be things are a bit more difficult in
the Debian environment where I assume updates and releases are done more
gradually.
Once some critical mass is reached, package maintainers will all provide
"make install-man-utf8" and configure can even test whether your man was
configured read UTF-8 and install the man pages in UTF-8 accordingly.
A man command line option to find out what encoding this binary assumes
for its input files would be a good idea to make that configure test
trivial.
> while all my opinion needs is rewriting groff (of course your opinion
> also needs this). On the other hands, my opinion
> - supports all existing manpages without conversion
> - doesn't need impossible re-education of manpage writers
> - supports any encodings for manpage writers (including UTF-8)
But is still needs tagging of all non-ASCII man pages to allow groff to
produce the correct output if man encoding and locale happen to be
incompatible (for instance with EUC man pages read under a UTF-8
locale). I believe, this tagging is *exactly* as much work as converting
to UTF-8!
> It is obvious my opinion is better and can be accepted by real users
> all over the world.
No.
> Your opinion is just a plot or at least a radicalism to kill
> non-UTF-8 encodings right now.
No, it is just almost as much work as your idea if we want to make sure
that man works satisfactority under UTF-8 locales, whereas you idea will
not guarantee that the originally reported problem will be solved.
> > -Tplaintext Plain text (charset according to locale)
> > -Tsgrtext Plain text with added ISO 6429 SGR (ESC [ ... m) emphasis
> > (charset according to locale)
> > -Tbstext Plain text with added backspace emphasis
> > (bold and underline only, charset according to locale)
>
> I am not interested in 'bs' emphasis mechanism.
Me neither, I just listed it as a way to preserve existing
functionality. I won't cry if it gets kicked out, but please extend
"less" first to handle the SGR sequence correctly.
I like the old notion of "ASCII plaintext" very much, I just believe
that it is time for a very careful face lift of the concept, in order to
ensure that it will keep us happy for the next century. That facelift
includes:
- Replace everywhere ASCII by UTF-8 to get a richer repertoire of
characters and symbols
- Make ESC [ ... m more widely supported for very basic style annotation
in pagers and plain text editors.
Especially use the above two to replace any form of BS/CR overstriking
tricks (which the ISO 8859 standard did forbid already explicitly
anyway).
I was recently in contact with the initiator of project Gutenberg, and
they are interested in updating their plaintext public domain literature
format guidelines to UTF-8 and ISO 6429 SGR as soon as a few more
editors to support entry comfortably are available.
The only problem with SGR over BS is that the former introduces state,
whereas the BS emphasis is almost as stateless as UTF-8. A compromise
would probably be to require that in a plaintext file with SGR
sequences, the last SGR sequence on every line (if there is any) must be
"ESC [ m". This way, style annotation of each line becomes
self-contained and tools such as grep will continue to work reasonably
well. It was my understanding that this is similar to existing
implementation practice with ISO-2022-JP, where also every single line
specifies the encoding, right?
> I think transliteration should not be done _after_ typesetting.
At least multichar transliteration such as "ä"->"ae" and "ß"->"ss" for
German must be taken into consideration during the line breaking
algorithm, otherwise right adjustment of margins will fail. But if you
do the typesetting on a wchar_t string with wcwidth (and if wcwidth will
provide eventually transliteration width information), then you could
leave doing the actual transliteration to wprintf etc. and you have one
thing less to worry about.
> I agree this is not a permanent solution. However, you agree that
> non-ASCII characters may be lost or displayed wrongly so far, don't you?
Yes.
> I think non-ASCII characters should not be used for important
> description in manpages for many years.
Not before UTF-8 for man pages has been widely deployed for use by
languages that can't at all be represented in ASCII.
> However, if there were some contents which cannot be expressed
> within ASCII and ASCII transliteration would cause fatal
> misunderstanding of manpages, it would be a serious problem.
There are some potential pitfalls with the use of quotation marks as
used in source code, but that is somewhat independent of UTF-8.
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/