[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Character set tagging considered harmful




Markus Kuhn wrote:

> Bram Moolenaar wrote on 1999-09-17 12:08 UTC:
> > Has anyone worked on a method to specify the file encoding with the file?
> 
> There are several approaches for this. They all failed badly and
> continue to be part of the problem then to bring us in any way closer to
> the solution.

Thanks for the overview!  This is very useful.

>   A) ISO 2022 = ECMA-35 

I agree this stinks.

>   B) The Byte Order Mark (BOM)
>      The Unicode UCS-2 crowd couldn't agree on whether they should use
>      bigendian or littleendian. So they defined U+FFFE to be
>      an illegal character and U+FEFF a zero-width no-breaking
>      space. This way, a file starting with FE FF smells like
>      bigendian UCS-2 and FF FE smells like littleendian. If you
>      convert either file to UTF-8, it will start with
>      EF BB BF (see Annex F of ISO 10646-1 on
>      <file:/homes/mgk25/public_html/ucs/ISO-10646-UTF-8.html>).
>      The Windows NT notepad seems to contain a (broken) autodetection
>      mechanism based on the BOM idea. It is not common practice
>      to use BOMs on POSIX systems.

This zero-width non-breaking space is what I was thinking about.
Vim could use this, I suppose:

	file starts with	encoding
	FF FE			Unicode little endian
	FE FF			Unicode big endian
	EF BB BF		UTF-8

This could be quite reliable.  Only strange binary files could be wrong, and
the encoding probably doesn't matter there. (I'm not sure Vim will support
Unicode soon, if at all, thus only the UTF-8 one would matter).

>   C) MIME
>      Used in applications where something resembling an RFC822
>      header starts the file. Widely used in web and mail archives
>      on POSIX systems today.

This requires something "outside" of the file, or a header at the top of the
file.  Not good as a generic solution for different types of files.

>   D) SGML
>      A document declaration can contain a description of the
>      document encoding is a horrendously bizarre way. It was never
>      widely used, even though nsgmls seems to implement it correctly.
>      SGML character set declarations are so bizarre that the XML
>      people gave up and hardwired it to be always UTF-8.

Only works for SGML files too.  Same for HTML and probably a few other file
types.  Not useful as a generic solution.

> I don't think that I am alone with the perception that all these approaches
> are exactly the opposite direction from where we want to head.
> 
> I convert every file I receive until I can read it. If I just want to
> read a file without modifying it, then I make a temporary copy that I
> convert, display, and discard immediately after I am done. Unix pipes
> are a very convenient way of making temporary copies that do not have to
> be saved in a new file.

Ah, but how do you know you need to convert it or not?  OK, you can look at
the file and guess the encoding.  But an automatic mechanism is what I'm
looking for.  At least for UTF8, because that can still be defined
(hopefully).

> Something like
> 
>   $ recode cp437..utf-8 < dos-file.txt | less
> 
> is a good way of reading a MS-DOS file under Linux. No need to pollute
> all my tools with knowledge about legacy encodings.

I rather pollute tools with knowledge than my own brain...

> Note that tagging every file with its character set is exactly as much
> effort as converting every file to UTF-8. You are really no closer to
> the solution after you tagged everything, because you still have to add
> a mechanism to every application to understand the tag. This is orders
> of magnitude more work than say just adding UTF-8 support.

I don't agree.  If all (or most) UTF8 encoded files contain a marker for the
encoding,  I can make Vim open files in "normal" ASCII encoding by default,
and automatically set the 'fileencoding' option to "UTF8" when it is detected.
That's simple.  Leaving it up to the user to set 'fileencoding' is bothersome.
This does require that all UTF8 files include that marker, of course.

> If however all my files are in UTF-8, then I can do without any changes
> to "grep" a "grep pattern *", and I will get the lines from all
> specified files that contain the pattern displayed correctly. None of
> the approaches above can do this. They require a lot of work and are
> still less functional.

Yes, if you are in paradise everything is perfect.  We already know you aim
for a UTF8-only solution.  And you probably understand by now that I prepare
for an environment of mixed encodings (perhaps you would call it hell :-).

By the way, grep should recognize file types to do its work properly (e.g.,
recognize an executable to avoid messing up my screen).  But that's not
relevant here.

> By the way, if you have currently only ASCII files on your system, then
> you have already fully migrated to UTF-8. Congratulations!

Forget it.  I have a lot of files from Germans which contain umlauts.  And
documentation files, executables with text, etc.  I have no idea how these
show up when they are assumed to be UTF8 files.

> Don't think that getting ISO 8859-1 support was as easy as striping out
> commands that nick the parity pit. There are many more things involved. For
> instance the fact that "bash" in its default configuration interprets
> 'A'+128 as Meta-A, i.e. an emacs-style editor control command causes in real
> life almost nobody under Linux to use any 8-bit filenames.

Well, I can only speak for Vim.  8-bit support wasn't difficult, once I
understood how compilers on different systems deal with unsigned chars.  After
implementing it properly, I never had to fix bugs.
UTF8 support is going to be a lot of work for Vim.  One of the biggest
problems is that several bytes will display only one character on the screen.
This involves cursor movement, making changes to the text, caching the screen
contents, etc.  Until now the Korean kind of characters were used, where
two-byte characters occupy two screen cells.  Only the round-off had to be
dealt with.  We are still fixing bugs as they are encountered.  I expect UTF8
to be more complicated.

> 99% of the publicly available tar files contain only ASCII files. They
> are already fully UTF-8 compliant.

All Linux RPMs are UTF8 compliant?  Don't think so...  And have a look on a
Japanese ftp server, lots of encodings to choose from!

> A side remark:
> 
> I recognize a fundamental philosophical difference between our views:
> You apparently like software to be smarter than the potentially ignorant
> user. You like software to hide from the user underlying technical
> problems. I like software to be simple, easy to understand and
> predictable at all levels by a moderately experienced user. If there are
> problems, I want to get involved to make sure that they will not
> reoccur. I like underlying problems to be solved directly and not
> covered by software that tries to be smart. Software that tries to be
> smarter then me usually fails badly. I associate attempts to engineer
> smart software for ignorant users more with the Microsoft tradition,
> while simple and robust concepts are more deeply rooted in the Unix
> culture.

I don't like the association between smart software and failing badly.  Smart
software doesn't fail (at least not more often than "normal" software).
Something else is "dumbed down" software, as someone called it (forgot who
that was).  That you could associate with Microsoft tradition.  It means
software that over-simplifies the problem, works well for 90% of the
situations, but fails in the remaining 10%.  And everybody runs into that 10%
some day.

Software should make work more easy for the user.  That has many implications.
Sometimes it means simple, robust solutions, sometimes it means smart
solutions, sometimes both.  There is no easy way out from real life.

> For instance, I don't like vim autodetecting CRLF conventions. If I open
> a file and I see lots of ^M line endings, I understand immediately that
> this file was accidentally not converted correctly when it was
> transferred. I enter ":%s/^M//g" (it would be nice to have a shortcut
> for this frequent substitution) and the problem is solved. I am in
> control and pretty much no bad things happen with this way of using my
> computer. With vim, I don't notice that I have wrongly coded text files,
> until they cause problems later elsewhere (e.g., if I accidentally
> included CRLF MS-DOS files into a tar distribution, I look like a stupid
> beginner to whoever downloads this file).

It seems we disagree here.  The automatic dectection of CRLF is a great plus
for most users.  No longer do you need to worry about what file system the
file is located on.  The principle Vim uses is to write out the file exactly
as it was read in.  Only when you want to convert fron LF to CRLF or back you
need to perform some action.  This is 100% reliable.  I don't see a need to
give more control to the user.

Compare this with an editor that only sees CRLF as end-of-line.  When editing
a Unix file it handles it like one long line.  Or with an editor that shows
"^M" at the end of every line.  If you want to keep them, you probably need to
remove all of them first, do your changes, and then put them back again before
you write.  When making further changes, you need to remove them again, etc.

If you think CRLF files are "wrongly encoded", you probably are in paradise
again! :-)  In the real world we just need to work with these files.
Considering reactions I get from users of Vim versus Vi, the way Vim works is
the best solution.  It appears you are one of the few exceptions (or you just
hate MS).

I agree that the automatic mechanisms must work reliable and predictable.
When you have something that works only 95% of the time, you get frustrated,
switch off the automatics and do it by hand.  But when the mechanism is
reliable, it's a big plus.

--
hundred-and-one symptoms of being an internet addict:
96. On Super Bowl Sunday, you followed the score by going to the
    Yahoo main page instead of turning on the TV.

--/-/---- Bram Moolenaar ---- Bram@moolenaar.net ---- Bram@vim.org ---\-\--
  \ \    www.vim.org/iccf      www.moolenaar.net       www.vim.org    / /
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/