[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Character set tagging considered harmful



Markus Kuhn wrote:
: Bram Moolenaar wrote on 1999-09-18 12:23 UTC:
: > Are you saying that it's not possible to detect UTF-8 encoding reliably?
: > Well, that's something that needs to be worked on!
: 
: LC_CTYPE is the best detector you will ever get. It allows us so far to
: distinguish ISO_8859-15 from JISX0208, and I see no reason why it should
: suddenly fail on UTF-8. Everything else is just a heuristic. The
: self-synchronizing properties of UTF-8 make it more feasible to write a
: > 95% heuristic for UTF-8 then for other encodings, but you should be
: careful to apply such autodetection ONLY when the user didn't tell you
: explicitely via LC_CTYPE what the intended encoding is. The user must be
: able to reliably enforce interpretation of the file as UTF-8 for
: mission-critical applications, where the remaining risk of autodetection
: or tagging is not acceptable.
I think there is some confusion here. Auto-detection applies to text, 
i.e. file contents, while I would assume LC_CTYPE to describe the 
environment that we're running in, especially the terminal mode.
This doesn't need to be the same and if LC_CTYPE is used to define one 
thing it should perhaps rather not be used to derive the other information 
which is usually quite unrelated.

Kind regards,
Thomas Wolff
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/