[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: filename encoding (was: ISO-2022)



Andries Brouwer wrote:

>     There is a potential big problem in this area.  If the kernel doesn't do
>     conversion, will all applications have to do this?
> 
> No. A filename is just a sequence of bytes - no conversion required
> or desirable.

>From the point of view of the kernel it's just a sequence of bytes (except for
'/').  From the point of view of the user the bytes form characters with a
specific meaning.  If you use the wrong character set, that meaning is lost.
Conversion is required to keep the meaning.

>     I suppose the filesystem should have a setting somewhere as to which
>     encoding is used for for the file names.  Applications (or the kernel)
>     should then do conversion.  Obviously, the encoding used for the
>     file system should match with the most often used locale to avoid
>     too many conversions.
> 
> It doesnt work (at present).
> Linux is a multi-user system. Different users with different nationalities
> use different locales. These Russians all want KOI-8, while the Danes
> want ISO 8859-1. Most filesystem types do not store the character set
> the filename is supposed to be in, and most users do not know enough
> to supply such information.

Well, it's about time we start this then.  If I mount some disk (or CD-ROM or
diskette or tape) and don't know what encoding is used for the file names, I
can only use trial and error to find out.  That's bad.

If we are going to introduce UTF-8 for file names (which is mostly a good
idea), there will be a conflict with ISO-8859 names currently used (especially
in Europe).  If this problem isn't solved properly, users will not convert to
using UTF-8.  That's why this problem needs to be tackled and discussed in
this list.

> That is why I agree with Bruno (on the first point) - everybody sets
> things right for his own locale, and sees his own filenames as intended. 
> In the long run we'll maybe all use UTF-8 and the problem disappears.

This conflicts with what you just said: different people currently use
different encodings.  We are going to add UTF-8 to that list.  Eventually
(hopefully) the others will die out, but that will take a long time.

UTF-8 may be the holy grail, but we have a long quest ahead to get to it.

Many people will resist switching to a new character set, even though it will
solve problems in the long run.  We need to make the transition go as easy as
possible, otherwise many people will not do it.

I think the problem is clear: file names can be encoded in any character set.
We need to know the character set used to do anything with those names.  Thus
the character set must be stored with the file system.  Either implicit (if
it's an old tape it's probably 7-bit ASCII, a FAT floppy is probably MS-DOS
codepage, etc.) or explicit (that would be a new mechanism).

For a new filesystem in Linux, it could be implicit UTF-8.  That would make
thinks simpler.  Although it does require checks for illegal byte sequences to
avoid the file system to become corrupted.

If the encoding is known, conversion can be done when required.  Where this
happens is to be decided.  Although I wouldn't be surprised if this was solved
somewhere by someone already.  Isn't it done for CD-ROM filesystems already?

-- 
hundred-and-one symptoms of being an internet addict:
119. You are reading a book and look for the scroll bar to get to
     the next page.

 ///  Bram Moolenaar -- Bram@xxxxxxxxxxxxx -- http://www.moolenaar.net  \\\
(((   Creator of Vim - http://www.vim.org -- ftp://ftp.vim.org/pub/vim   )))
 \\\  Help me helping AIDS orphans in Uganda - http://iccf-holland.org  ///
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/