[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Global LC_CTYPE and file names
Andries.Brouwer@cwi.nl wrote on 1999-09-16 16:18 UTC:
> From: Markus Kuhn <Markus.Kuhn@cl.cam.ac.uk>
>
> Andries.Brouwer@cwi.nl wrote on 1999-09-16 13:19 UTC:
> > So all users on the system have the same LC_CTYPE?
>
> As long as they want to share any non-ASCII plaintext (including from
> foreign file systems): YES.
>
> Hmm. A POSIX compliant system allows its users to set LC_CTYPE
> to any desired value, even for each process separately.
True, and this does not mean that users have to do this. POSIX only says
that users should be able to tell each application, which character
encoding they use in their files, file names, etc. POSIX does not say or
recommend or endorse that users should use different character encodings
in their files on a system, or that the file system should support
different encoding views of files, or anything like that. On the
contrary! There are many places in POSIX.2 that suggest that using
different LC_CTYPE values could cause trouble. There are indeed more
recent POSIX documents which recommend that you should use UTF-8
everywhere, realizing that POSIX is not really capable of supporting
multiple encodings on the same system simultaneously in a realistic way,
because files, environment variables, file names, pipes, ttys, etc. do
not have locale tags. If we introduced locale tags (file types, grrr)
for all of these, it would not be Unix any more (more like Multics).
It is today not common practice that people use multiple LC_CTYPE values
on the same system. People in Germany tend to use Latin-1 everywhere on
their system, people in Russia tend to use KOI8-R everywhere, etc. If
people use multiple LC_CTYPE values today on a single system, they are
likely to get bad results occasionally, i.e. unreadable filenames, etc.
Those few who do this are used to get bad results and can live with it.
The solution is to migrate to a single system that is suitable for
everyone, i.e. UTF-8. Then we can forget about the encoding aspects of
locales. The meaning of LANG etc. will really be reduced to language
preferences, etc., which are much less critical for interoperability
then the character encoding. Language preferences are much easier to
specify on a per process basis in a system.
We want to encourage users to use only one encoding, because this is
simple, robust, and technically sound. POSIX does allow and encourage
this. The fact that POSIX allows different processes to theoretically
have different opinions about the external character encoding does in no
way mean that this is a good thing and should be supported with enormous
additional machinery. It just means that the POSIX mechanism is more
flexible then what will be necessary in the long term.
> It is impossible for the mounts done at boot time to react to
> the user's environment variables.
Therefore, the user's environment variables shouldn't change at runtime
in this matter. It is that simple and there is nothing wrong with it.
I think, you haven't seen the light yet. Please read the Plan9 paper
ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/UTF-8-Plan9-paper.ps.gz
The fathers of Unix have already seen the light 8 years ago: Plan9 has
no LC_CTYPE at all. It is hardwired to UTF-8 and works much better than
any other Unix i18n attempt I have seen so far.
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/