[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: filetype field?
PILCH Hartmut wrote:
> On Wed, 3 Nov 1999, Bram Moolenaar wrote:
>
> > > - The Unix kernel #!/bin/sh mechanism will break, because the
> > > file will not start any more with #!
> >
> > Good point. Putting the BOM in the second line would work. But that's a
> > bit strange. It would be better to adjust the kernel to handle UTF-8
> > files, and thus ignore the BOM in this position. Just one more place that
> > needs to be UTF-8 aware, not a big deal.
>
> If the kernel is to look for a UTF-8 BOM, it might as well look for a
> general encoding marker. That seems to be what you are using the BOM for.
> There is no byte order to be marked in UTF-8 texts, is there?
No, the suggestion is to use the BOM to mark a file as being an UTF-8 file.
There is no byte order problem in UTF-8, since it is always an ordered stream
of bytes. The BOM is ignored otherwise. The advantage of using the BOM as
marking a file as UTF-8 is that it's already in the unicode standard, thus no
new "general encoding marker" byte sequence needs to be introduced.
> If the kernel is to be changed, why not go to the roots and introduce an
> filetype field into the inode table, similar to the permissions field,
> with commands like
>
> $ chft "text/plain; charset=utf-8" file1.txt
> $ chft "text/plain; charset=iso-8859-1" file2.txt
> $ chft "image/png" file.png
>
> and a
>
> /etc/filetypes
>
> table that associates mime types to code numbers in a
> tending-to-become-standardized way?
Well, that might work for Linux systems specifically. Feel free to suggest it
to the people that work on it. What I am looking for myself is a solution
that works on all platforms that Vim runs on, including Windows, Solaris, Mac,
OS/2, etc. All these systems could have UTF-8 files that you would want to
edit with Vim.
> That could at least ensure that no BOMs are misplaced during
>
> $ cat file1.txt file2.txt > file.txt
For Linux, yes. But the problems that already have been mentioned with
networked file systems remain.
Again, introducing a BOM for UTF-8 files isn't without problems. But leaving
out the BOM also has its problems, since it's hard to know if a file is UTF-8
encoded then. Autodetection needs to be used, which slows down loading a file
and it isn't 100% reliable.
One issue I just thought off: Isn't it true that an UTF-8 file can always
legally start with a BOM? I mean, the standard does allow this, doesn't it?
Then all UTF-8 aware applications should be able to handle it correctly. This
mostly means they ignore the BOM, and handle it like a non-printing,
zero-width character. Perhaps this should be added to tests for UTF-8
compability. Of course, this doesn't change anything for applications that
are _not_ UTF-8 aware and the problems they might have with a BOM.
--
hundred-and-one symptoms of being an internet addict:
91. It's Saturday afternoon in the middle of may and you are on computer.
--/-/---- Bram Moolenaar ---- Bram@moolenaar.net ---- Bram@vim.org ---\-\--
\ \ www.vim.org/iccf www.moolenaar.net www.vim.org / /
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/