[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: utf-8 encoding scheme



Followup to:  <Pine.LNX.4.10.10006261642210.29611-100000@xxxxxxxxxxxxxxxx>
By author:    Jeu George <jeu@xxxxxxxxxxxxxxxx>
In newsgroup: linux.utf8
>
> 
> Hello,
> 
> 	The utf-8 encoding scheme goes like this
>   for
>   1-byte characters 0xxxxxxx 
>   2-byte characters 110xxxxx 10xxxxxx
>   3-byte characters 1110xxxx 10xxxxxx
> 

4-byte characters	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5-byte characters	111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
6-byte characters	1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

> here the bits marked x are used up for the actuall encoding of characters
> i would like to know the way these bits are used to code a particular
> charter, also is this dependent on the operating system, can u provide a
> program which checks finds this or any link that provides information
> about this

The bits are encoded bigendian (MSB first), i.e. the way you would
read the bits when written in the above form.

It is also very important to realize that ONLY THE SHORTEST POSSIBLE
SEQUENCE IS LEGAL.  This is incredibly important, since any misguided
attempt to "be liberal in what you accept" without addition of an
explicit canonicalization step would lead to the kind of security
holes that Microsoft web-related applications have been so full of,
because MS operating systems have way too many ways to say the same
thing.

Thus, the character K <U+004B> is encoded as:

	01001011

The alternate spelling

	11000001 10001011

... is not the character K <U+004B> but INVALID SEQUENCE.  One
possible thing to do in a decoder is to emit U+FFFD SUBSTITUTION
CHARACTER on encountering illegal sequences.

	-hpa

-- 
<hpa@xxxxxxxxxxxxx> at work, <hpa@xxxxxxxxx> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/