[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Character set tagging considered harmful




Robert Brady wrote:

> Please make that read, 
> 
>         FF FE                   UCS-2 little endian
>         FE FF                   UCS-2 big endian
> 
> And don't forget
>         FF FE 00 00             UCS-4 little endian
>         00 00 FE FF             UCS-4 big endian
> 
> Unicode != UCS-2.

Ah, of course.  Thanks for the correction.  I wonder, is UCS-4 the maximum
that is in use today?  I need to reserve space for each character, thus I
would like to know if 4 bytes is enough.  The UTF-8 encoding might be longer,
of course.

> We already have an environment of mixed encodings. It is very messy. The
> whole idea of UTF-8 support is to move to one encoding.

I thought the whole idea of UTF-8 was to be compatible with 7-bit ASCII.
Making the switch from the past to the future easy for people was one of the
reasons to make UTF-8 potentially the most popular encoding.

> We aren't going to force anybody to use it, we just want the option to
> exist. Eventually everyone will use it, as it just makes much more sense
> to have one universal encoding.

Yes, eventually.  But there will be a time with mixed encodings in between.
We need to deal with that.

> > I agree that the automatic mechanisms must work reliable and predictable.
> > When you have something that works only 95% of the time, you get frustrated,
> > switch off the automatics and do it by hand.  But when the mechanism is
> > reliable, it's a big plus.
> 
> But it's not possible to build such a reliable mechanism. If we build
> unreliable mechanisms, then no-one will use them, and people will dislike
> UTF-8 as "they were the people who made a whole bunch of tools do stupid
> stuff for their own purposes, but it was working fine before".

Are you saying that it's not possible to detect UTF-8 encoding reliably?
Well, that's something that needs to be worked on!

> We just want a toggle, between Mess and UTF-8.

And we need to help the people that have to toggle all the time.

> Sure, some apps, like mailreaders and newsreaders, and webbrowsers, and
> arguably perhaps editors, need to know about character encodings other
> than the one the user is using, but 99% of programs ought to just look at
> LC_CTYPE and obey.
> 
> We aren't going to solve multiple-characterset problems, we're just here
> to make everything use a single-characterset, which is a much neater
> solution.

I don't know who "we" are here.  For myself, I want to do both.  Aim for a
single-encoding in the end AND make an acceptable path to get there.  If you
just aim for the final solution and forget about how to get there, it has a
large chance to fail!  We are already in a mixed-encoding environment, we have
to deal with it.

Switching to a single encoding is not an option for most people at this time,
since many files are Latin-1 encoded.  If we make it easy to use UTF8 instead
of other encodings, the number of UTF8 encoded files will grow.  That will
make it possible to switch over completely after some time.

If you really want to get to that single-encoding paradise, you should make
sure that UTF8 has a big advantage over all other encodings TODAY.  Being able
to reliabaly detect a UTF8 encoded file will certainly help.  And it will
certainly not be a disadvantage when you are in a single-encoding environment.

--
hundred-and-one symptoms of being an internet addict:
105. When someone asks you for your address, you tell them your URL.

--/-/---- Bram Moolenaar ---- Bram@moolenaar.net ---- Bram@vim.org ---\-\--
  \ \    www.vim.org/iccf      www.moolenaar.net       www.vim.org    / /
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/