[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Character set tagging considered harmful
Markus Kuhn wrote:
> The UCS-2 range is the maximum in use today. There are no characters yet
> defined outside the range U+0000 to U+FFFD, which is known as "Plane 0"
[...]
> (again, also character not used on computers today). So it is good to be
> prepared for more than UCS-2.
>
> UTF-16 is an extension of UCS-2 that uses a pair of 16-bit characters
> from a high and low surrogate area in UCS-2 to represent characters in
> planes 1 to 16 (U+010000 to U+10FFFF). UTF-16 can cover a bit over 1
[...]
> the future. A code range of 1 million is commonly considered to be more
> then good enough. Plenty of room for contact with
> extraterrestrials ... ;-)
OK, so I should prepare to store characters 0-FFFF for sure, and 0-10FFFF to
be on the safe side. That's two-and-a-half byte.
> > I need to reserve space for each character, thus I
> > would like to know if 4 bytes is enough.
>
> 4-bytes per character is *more* then enough per character. UCS is just a
> 31-bit character set after all, so a signed 32-bit int (that is what
> glibc's wchar_t is) will more then do. Even 3 bytes will last forever
> and 2-bytes would be OK so far if you are prepared to handle pairs of
> UTF-16 surrogate values as single characters.
I suppose those characters still can occupy one screen cell (in a fixed-width
font)? What I need is to reserve space for each screen cell. Hmm, I notice
that it's possible to have two or three characters occupy one screen cell
(with combining characters). I'll probably not support combining characters
at first, thus sticking to level 1.
> > The UTF-8 encoding might be longer, of course.
>
> No. Better have another careful look at how UTF-8 really works:
I mean, if you take a two-byte character code, it can be from 0 to FFFF. You
need three bytes when using UTF-8 encoding.
I don't know yet if I want to store the text UTF-8 encoded or as unicode
characters. Perhaps both (at different places). It also depends on the need
to store character attributes (color, bold, underline, etc.).
> XFree86 xterm is restricted to the UCS-2 range by the way, as is the X11
> font mechanism.
>
> My advice would be to try and keep UTF-8 as the in-memory encoding. Do
> not convert to a fixed-width encoding unless really necessary for
> table-lookups, etc. The self-synchronizing properties of UTF-8 make this
> very feasible. You can even preserve illegal UTF-8 sequences this way
> such that you loose no information if you load and save a binary file
> accidentally in UTF-8 mode. Mined98 is doing this nicely, as are a
> number of other existing UTF-8 editors. The plan for emacs is also to
> keep UTF-8 as the in-memory representation, in the interest of binary
> transparency.
Sounds like a good idea. Looking through the xterm code, I notice that it
stores each character of the screen in two bytes (excluding color and
attributes). This can't be the UTF-8 code, that would require three bytes.
Since Vim does screen handling like xterm, I might end up doing the same.
> > Are you saying that it's not possible to detect UTF-8 encoding reliably?
> > Well, that's something that needs to be worked on!
[...]
> I assure you, that UTF-8 files will not be tagged in any special way on
> POSIX systems. Just like ASCII and ISO 646-Swedish files were never
> tagged in any special way. Typed files are simply not the Unix way, for
> very good reasons. There will be no BOM or ESC 2022 announcer, and if
> there is one occasionally, it will either cause trouble or be lost after
> the next cut & paste, grep, tail, conversion, etc. This stuff is not
> robust in general. It might work in special restricted applications, but
> not more. The world is already full of UTF-8 files. Search for UTF-8 on
> dejanews, and you'll hit a hundred thousand postings, because Asian
> versions of Netscape and IE have been sending out UTF-8 files for years.
I suppose there is a mechanism for cut&paste, using the X selection, to
indicate the encoding of the text. I noticed a remark about adding UTF8
somewhere.
For Vim, I could write files that are in UTF8 format with the invisible tag
mentioned in a previous message. At least that will help when opening the
same file on another system with Vim. If this works well, other applications
might start doing the same. This tag should not hurt anyone with a UTF8-only
system. However, if this tag does break something, I won't add it.
> > Switching to a single encoding is not an option for most people at this
> > time, since many files are Latin-1 encoded.
>
> The files are really not the problem. Files are very easily converted
> without loss of information.
The _can_ be converted, but will it happen? Why do people still use Windows
3.1 or MS-DOS, even though they now it's bad? There often is a reason to
keep using old stuff, even though the new stuff is better.
> The problem are applications that can structurally not yet deal with files
> that can contain a million different characters. Most applications believe
> that there exist not more than 256 characters. That is the real problem.
Well, as long as there is one relevant application that has this problem,
files can't all be converted as a result. The logical consequence is that
Latin-1 files will be around for a while...
--
hundred-and-one symptoms of being an internet addict:
112. You are amazed that anyone uses a phone without a modem on it...let
alone hear actual voices.
--/-/---- Bram Moolenaar ---- Bram@moolenaar.net ---- Bram@vim.org ---\-\--
\ \ www.vim.org/iccf www.moolenaar.net www.vim.org / /
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/