[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Character set tagging considered harmful



towo@computer.org wrote on 1999-09-21 13:08 UTC:
> I think there is some confusion here. Auto-detection applies to text, 
> i.e. file contents, while I would assume LC_CTYPE to describe the 
> environment that we're running in, especially the terminal mode.
> This doesn't need to be the same and if LC_CTYPE is used to define one 
> thing it should perhaps rather not be used to derive the other information 
> which is usually quite unrelated.

I really think, they are the same, they were intended to be the same and
in my opinion they really should be the same. I like

  cat <file.txt

and

  cat >file.txt

to continue to work in our notion of plaintext also in the future,
therefore we should always aim towards keeping the content of plain-text
something that can be sent directly byte-for-byte to the terminal. Much
of the current simplicity, elegance and power of the Unix plaintext
world fundamentally depends on this. It won't be Unix any more if we
start to introduce plaintext file types. (By the way, we had this exact
same discussion already back in 1995 on comp.std.internat, should still
be in dejanews.)

How far do you want to implement autodetection? Do you want "ls" to
autodetect, whether a filename is in Latin-2, Latin-15, JIS X0208 or
UTF-8 and convert automatically accordingly? Character set
autodetection, if it really became common-place under Unix, would mean
that practically every application would have to be equipped with a
full-fledged any-to-any conversion package. Horrible prospect. No, I
really really think that separating the plain-text and terminal encoding
is a rather dangerous route, that I most certainly will not support in
any way. All this also has nothing to do with UTF-8, which is just yet
another encoding and should be treated just as such. The entire
autodetection or tagging business sounds to me very much like
reinventing ISO 2022 with all its consequences.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/