[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Character set tagging considered harmful
Bram Moolenaar wrote on 1999-09-17 12:08 UTC:
> Has anyone worked on a method to specify the file encoding with the file?
There are several approaches for this. They all failed badly and
continue to be part of the problem then to bring us in any way closer to
the solution.
A) ISO 2022 = ECMA-35
ftp://ftp.ecma.ch/ECMA-ST/E035-PDF.PDF
This uses ESC sequences to announce the character encoding.
The ESC sequences are registered internationally on
http://www.itscj.ipsj.or.jp/ISO-IR/
The ESC sequence that announces UTF-8 is ESC % G. In my
opinion, it should only be used to switch remote terminal
emulators into a UTF-8 mode. It is not common practice on
POSIX systems to embed ISO 2022 sequences into plain-text
files, file names, environment variables, pipes, etc.
Japanese people tried and failed. They use now EUC instead
as their single system-wide encoding. The MULE folks do some
form of ISO 2022 support, as does the X11 compound text
selection mechanism. In both camps, there are tendencies
to dump it and go for pure Unicode instead.
B) The Byte Order Mark (BOM)
The Unicode UCS-2 crowd couldn't agree on whether they should use
bigendian or littleendian. So they defined U+FFFE to be
an illegal character and U+FEFF a zero-width no-breaking
space. This way, a file starting with FE FF smells like
bigendian UCS-2 and FF FE smells like littleendian. If you
convert either file to UTF-8, it will start with
EF BB BF (see Annex F of ISO 10646-1 on
<file:/homes/mgk25/public_html/ucs/ISO-10646-UTF-8.html>).
The Windows NT notepad seems to contain a (broken) autodetection
mechanism based on the BOM idea. It is not common practice
to use BOMs on POSIX systems.
C) MIME
Used in applications where something resembling an RFC822
header starts the file. Widely used in web and mail archives
on POSIX systems today.
D) SGML
A document declaration can contain a description of the
document encoding is a horrendously bizarre way. It was never
widely used, even though nsgmls seems to implement it correctly.
SGML character set declarations are so bizarre that the XML
people gave up and hardwired it to be always UTF-8.
I don't think that I am alone with the perception that all these
approaches are exactly the opposite direction from where we
want to head.
I convert every file I receive until I can read it. If I just want to
read a file without modifying it, then I make a temporary copy that I
convert, display, and discard immediately after I am done. Unix pipes
are a very convenient way of making temporary copies that do not have to
be saved in a new file.
Something like
$ recode cp437..utf-8 < dos-file.txt | less
is a good way of reading a MS-DOS file under Linux. No need to pollute
all my tools with knowledge about legacy encodings.
Note that tagging every file with its character set is exactly as much
effort as converting every file to UTF-8. You are really no closer to
the solution after you tagged everything, because you still have to add
a mechanism to every application to understand the tag. This is orders
of magnitude more work than say just adding UTF-8 support.
If however all my files are in UTF-8, then I can do without any changes
to "grep" a "grep pattern *", and I will get the lines from all
specified files that contain the pattern displayed correctly. None of
the approaches above can do this. They require a lot of work and are
still less functional.
By the way, if you have currently only ASCII files on your system, then
you have already fully migrated to UTF-8. Congratulations! Don't think
that getting ISO 8859-1 support was as easy as striping out commands
that nick the parity pit. There are many more things involved. For
instance the fact that "bash" in its default configuration interprets
'A'+128 as Meta-A, i.e. an emacs-style editor control command causes in
real life almost nobody under Linux to use any 8-bit filenames.
99% of the publicly available tar files contain only ASCII files. They
are already fully UTF-8 compliant.
A side remark:
I recognize a fundamental philosophical difference between our views:
You apparently like software to be smarter than the potentially ignorant
user. You like software to hide from the user underlying technical
problems. I like software to be simple, easy to understand and
predictable at all levels by a moderately experienced user. If there are
problems, I want to get involved to make sure that they will not
reoccur. I like underlying problems to be solved directly and not
covered by software that tries to be smart. Software that tries to be
smarter then me usually fails badly. I associate attempts to engineer
smart software for ignorant users more with the Microsoft tradition,
while simple and robust concepts are more deeply rooted in the Unix
culture.
For instance, I don't like vim autodetecting CRLF conventions. If I open
a file and I see lots of ^M line endings, I understand immediately that
this file was accidentally not converted correctly when it was
transferred. I enter ":%s/^M//g" (it would be nice to have a shortcut
for this frequent substitution) and the problem is solved. I am in
control and pretty much no bad things happen with this way of using my
computer. With vim, I don't notice that I have wrongly coded text files,
until they cause problems later elsewhere (e.g., if I accidentally
included CRLF MS-DOS files into a tar distribution, I look like a stupid
beginner to whoever downloads this file).
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/