[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: wcwidth update
Hi,
Egmont wrote:
> Now that recently every standard seemed to agree that UTF-8 uses at most 4
> (and not 6) bytes and the highest valid Unicode value is U+1FFFFF, I wonder
U+10FFFF, actually.
> whether the stress test should be updated, too. As far as I understand, the
> preferred new behavior for a former 5 or 6 byte long UTF-8 sequence is to
> emit 5 or 6 replacement character, since the first byte is invalid, and
> subsequent bytes are unexpected continuation bytes.
I have not heard anything like this before (about changing behaviour
of emitted replacement characters) and it would be really confusing to
introduce it. UTF-8 is a simple and straight-forward encoding scheme which
happens to cover full historic 31-bit ISO 10646. That many of those
code points are now invalid does not necessarily mean that the interpretation
of UTF-8 would have to be changed. I don't think it's worth introducing
this additional headache, especially as it would introduce new inconsistencies
between older and newer versions of terminals, which we already have plenty of.
Why cannot a long UTF-8 sequence that happens to map to a code point which is
not Unicode just be displayed with one replacement character? There is no
good reason for this, please don't push it forward.
Kind regards,
Thomas
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/