[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Character set tagging ... / UTF-8 detection
: Are you saying that it's not possible to detect UTF-8 encoding reliably?
: Well, that's something that needs to be worked on!
: ... Being able
: to reliabaly detect a UTF8 encoded file will certainly help. And it will
: certainly not be a disadvantage when you are in a single-encoding environment.
UTF-8 text encoding auto-detection
count the following pairs of consecutive bytes as shown in the table:
11.. 10.. good
00.. 10.. bad
10.. 10.. don't care
11.. 00.. bad
11.. 11.. bad
00.. 00.. don't care
10.. 00.. don't care
00.. 11.. don't care
10.. 11.. don't care
the algorithm does not consider correct lengths of UTF-8 sequences
but I think it's good enough
for all bytes in file do
if (current_byte & 0xC0) == 0x80 {
if (previous_byte & 0xC0) == 0xC0 {
count_good_utf ++;
} else if (previous_byte & 0x80) == 0x00 {
count_bad_utf ++;
}
} else if (previous_byte & 0xC0) == 0xC0 {
count_bad_utf ++;
}
finally,
if auto_UTF_detect == True {
if (UTF_BOM_detected == True
|| count_good_utf >= count_bad_utf)
utf8_text = True;
else
utf8_text = False;
}
the comparison ">=" handles pure ASCII files as UTF-8,
replace it with ">" to change that
Kind regards,
Thomas Wolff
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/