Markus Kuhn <mgk25@xxxxxxxxx> writes:
Question 1:
> There is a contradiction in the above: A 4-byte UTF-8 word has only
> space for 6*3+3=21 payload bits, so how do you plan to fit 22 bits in
> this?
Oops, sorry, it is just my mistake. I mean 5-byte.
b) Instead of UTF-8, use your own variant (let's call it UTF-E1)
which uses for example the following 4 multi-byte sequences:
0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
1111xxxx 10xxxxxx 10xxxxxx 10xxxxxx
Interesting idea! But, I think we don't have to save just
one byte for vare rarely used characters.
But if you really want to deviate from UTF-8, then it is worth
examining more fully, what properties/tradeoffs of UTF-8
are actually needed for the new Emacs buffer-multi-byte encoding.
UTF-8 is ASCII compatible, preserves the UCS-4BE strcmp result
and is self synchronizing. Is all that needed inside an Emacs
buffer? Would for example a simpler 21-bit encoding (let's
call it UTF-E2) without self-synchronization but all the other
properties such as
0xxxxxxx
1xxxxxxx 1xxxxxxx 1xxxxxxx
be better suited (it would require slightly modified
string-search algorithms though, for instance)?
As we need 22-bit, we must encode all non-ASCII chars in
4-byte with the above idea. Isn't it too much?
c) With 21-bit words, you support the range 0x00_00_00 to
0x1F_FF_FF. But as Unicode and ISO promised that they will
never use any code points above U-10FFFF, you have even in
a 21-bit word the 0xF_00_00 = 983040 code positions
0x11_00_00 to 0x1F_FF_FF available for private use by emacs.
Aren't almost a million private use positions more than good
enough for what Emacs could need privately?
CCCII will require 884736 (= 96*96*96) code-space, even
though it is vary sparse.
Question 2:
Many encodings (such as UTF-8 and others) have many possible
malformed sequences that a normal decoder would reject. What will
the UTF-8 -> Emacs converter do if it runs into one of these?
Suggestion: It would seem good to have in the 21/22-bit Emacs space 256
special characters allocated for representing bytes that came from
malformed sequences. They would be displayed to the user in some \hex
> notation, they can be edited like any normal characters and there are even
keyboard functions for inserting new malformed UTF-8 bytes. The Emacs ->
> UTF-8 encoder will insert these bytes into the produced bytestream such
that a UTF-8 -> Emacs -> UTF-8 roundtrip becomes a completely 100%
binary-transparent operation.
I mostly agree. Currently, for such an invalid byte, I
think we can use a little trick of representing raw
0x80..0xFF by this sequence:
1100000x 10xxxxxx
(following-char) will return 0x80..0xFF on such a place,
thus then can't be distinguished from normal Unicode
characters, but it won't be a big problem.