[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Character set tagging considered harmful
Markus Kuhn wrote:
: Bram Moolenaar wrote on 1999-09-18 12:23 UTC:
: > Are you saying that it's not possible to detect UTF-8 encoding reliably?
: > Well, that's something that needs to be worked on!
:
: LC_CTYPE is the best detector you will ever get. It allows us so far to
: distinguish ISO_8859-15 from JISX0208, and I see no reason why it should
: suddenly fail on UTF-8. Everything else is just a heuristic. The
: self-synchronizing properties of UTF-8 make it more feasible to write a
: > 95% heuristic for UTF-8 then for other encodings, but you should be
: careful to apply such autodetection ONLY when the user didn't tell you
: explicitely via LC_CTYPE what the intended encoding is. The user must be
: able to reliably enforce interpretation of the file as UTF-8 for
: mission-critical applications, where the remaining risk of autodetection
: or tagging is not acceptable.
I think there is some confusion here. Auto-detection applies to text,
i.e. file contents, while I would assume LC_CTYPE to describe the
environment that we're running in, especially the terminal mode.
This doesn't need to be the same and if LC_CTYPE is used to define one
thing it should perhaps rather not be used to derive the other information
which is usually quite unrelated.
Kind regards,
Thomas Wolff
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/