[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Character set tagging considered harmful
On Fri, 17 Sep 1999, Bram Moolenaar wrote:
> Markus Kuhn wrote:
>
> > Bram Moolenaar wrote on 1999-09-17 12:08 UTC:
> > > Has anyone worked on a method to specify the file encoding with the file?
> >
> > There are several approaches for this. They all failed badly and
> > continue to be part of the problem then to bring us in any way closer to
> > the solution.
>
> Thanks for the overview! This is very useful.
>
> > A) ISO 2022 = ECMA-35
>
> I agree this stinks.
>
> > B) The Byte Order Mark (BOM)
> > The Unicode UCS-2 crowd couldn't agree on whether they should use
> > bigendian or littleendian. So they defined U+FFFE to be
> > an illegal character and U+FEFF a zero-width no-breaking
> > space. This way, a file starting with FE FF smells like
> > bigendian UCS-2 and FF FE smells like littleendian. If you
> > convert either file to UTF-8, it will start with
> > EF BB BF (see Annex F of ISO 10646-1 on
> > <file:/homes/mgk25/public_html/ucs/ISO-10646-UTF-8.html>).
> > The Windows NT notepad seems to contain a (broken) autodetection
> > mechanism based on the BOM idea. It is not common practice
> > to use BOMs on POSIX systems.
>
> This zero-width non-breaking space is what I was thinking about.
> Vim could use this, I suppose:
>
> file starts with encoding
> FF FE Unicode little endian
> FE FF Unicode big endian
> EF BB BF UTF-8
>
> This could be quite reliable. Only strange binary files could be wrong, and
> the encoding probably doesn't matter there. (I'm not sure Vim will support
> Unicode soon, if at all, thus only the UTF-8 one would matter).
Please make that read,
FF FE UCS-2 little endian
FE FF UCS-2 big endian
And don't forget
FF FE 00 00 UCS-4 little endian
00 00 FE FF UCS-4 big endian
Unicode != UCS-2.
> > D) SGML
> > A document declaration can contain a description of the
> > document encoding is a horrendously bizarre way. It was never
> > widely used, even though nsgmls seems to implement it correctly.
> > SGML character set declarations are so bizarre that the XML
> > people gave up and hardwired it to be always UTF-8.
>
> Only works for SGML files too. Same for HTML and probably a few other file
> types. Not useful as a generic solution.
Exactly. There are no good generic solutions.
> > If however all my files are in UTF-8, then I can do without any changes
> > to "grep" a "grep pattern *", and I will get the lines from all
> > specified files that contain the pattern displayed correctly. None of
> > the approaches above can do this. They require a lot of work and are
> > still less functional.
>
> Yes, if you are in paradise everything is perfect. We already know you aim
> for a UTF8-only solution. And you probably understand by now that I prepare
> for an environment of mixed encodings (perhaps you would call it hell :-).
We already have an environment of mixed encodings. It is very messy. The
whole idea of UTF-8 support is to move to one encoding.
We aren't going to force anybody to use it, we just want the option to
exist. Eventually everyone will use it, as it just makes much more sense
to have one universal encoding.
> I agree that the automatic mechanisms must work reliable and predictable.
> When you have something that works only 95% of the time, you get frustrated,
> switch off the automatics and do it by hand. But when the mechanism is
> reliable, it's a big plus.
But it's not possible to build such a reliable mechanism. If we build
unreliable mechanisms, then no-one will use them, and people will dislike
UTF-8 as "they were the people who made a whole bunch of tools do stupid
stuff for their own purposes, but it was working fine before".
We just want a toggle, between Mess and UTF-8.
Sure, some apps, like mailreaders and newsreaders, and webbrowsers, and
arguably perhaps editors, need to know about character encodings other
than the one the user is using, but 99% of programs ought to just look at
LC_CTYPE and obey.
We aren't going to solve multiple-characterset problems, we're just here
to make everything use a single-characterset, which is a much neater
solution.
--
Robert
The ASCII Consortium : dragging character encoding kicking and screaming
into the 20th century! <http://www.ecs.soton.ac.uk/~rwb197/ascii/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/