[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Character set tagging ... / UTF-8 detection



: Are you saying that it's not possible to detect UTF-8 encoding reliably?
: Well, that's something that needs to be worked on!
: ... Being able
: to reliabaly detect a UTF8 encoded file will certainly help.  And it will
: certainly not be a disadvantage when you are in a single-encoding environment.

UTF-8 text encoding auto-detection

count the following pairs of consecutive bytes as shown in the table:
11..	10..	good
00..	10..	bad
10..	10..	don't care
11..	00..	bad
11..	11..	bad
00..	00..	don't care
10..	00..	don't care
00..	11..	don't care
10..	11..	don't care

the algorithm does not consider correct lengths of UTF-8 sequences 
but I think it's good enough

	for all bytes in file do
	    if (current_byte & 0xC0) == 0x80 {
		if (previous_byte & 0xC0) == 0xC0 {
			count_good_utf ++;
		} else if (previous_byte & 0x80) == 0x00 {
			count_bad_utf ++;
		}
	    } else if (previous_byte & 0xC0) == 0xC0 {
		count_bad_utf ++;
	    }

	finally,
	    if auto_UTF_detect == True {
		if (UTF_BOM_detected == True 
		    || count_good_utf >= count_bad_utf)
			utf8_text = True;
		else
			utf8_text = False;
	    }

	the comparison ">=" handles pure ASCII files as UTF-8,
	replace it with ">" to change that

Kind regards,
Thomas Wolff
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/