[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: filetype field?




PILCH Hartmut wrote:

> On Wed, 3 Nov 1999, Bram Moolenaar wrote:
>  
> > >  - The Unix kernel #!/bin/sh mechanism will break, because the
> > >    file will not start any more with #!
> > 
> > Good point.  Putting the BOM in the second line would work.  But that's a
> > bit strange.  It would be better to adjust the kernel to handle UTF-8
> > files, and thus ignore the BOM in this position.  Just one more place that
> > needs to be UTF-8 aware, not a big deal.
> 
> If the kernel is to look for a UTF-8 BOM, it might as well look for a
> general encoding marker.  That seems to be what you are using the BOM for.
> There is no byte order to be marked in UTF-8 texts, is there?

No, the suggestion is to use the BOM to mark a file as being an UTF-8 file.
There is no byte order problem in UTF-8, since it is always an ordered stream
of bytes.  The BOM is ignored otherwise.  The advantage of using the BOM as
marking a file as UTF-8 is that it's already in the unicode standard, thus no
new "general encoding marker" byte sequence needs to be introduced.

> If the kernel is to be changed, why not go to the roots and introduce an
> filetype field into the inode table, similar to the permissions field,
> with commands like
> 
> 	$ chft "text/plain; charset=utf-8" file1.txt 
> 	$ chft "text/plain; charset=iso-8859-1" file2.txt 
> 	$ chft "image/png" file.png
> 
> and a 
> 
> 	/etc/filetypes
> 
> table that associates mime types to code numbers in a
> tending-to-become-standardized way?

Well, that might work for Linux systems specifically.  Feel free to suggest it
to the people that work on it.  What I am looking for myself is a solution
that works on all platforms that Vim runs on, including Windows, Solaris, Mac,
OS/2, etc.  All these systems could have UTF-8 files that you would want to
edit with Vim.

> That could at least ensure that no BOMs are misplaced during 
> 
> 	$ cat file1.txt file2.txt > file.txt

For Linux, yes.  But the problems that already have been mentioned with
networked file systems remain.

Again, introducing a BOM for UTF-8 files isn't without problems.  But leaving
out the BOM also has its problems, since it's hard to know if a file is UTF-8
encoded then.  Autodetection needs to be used, which slows down loading a file
and it isn't 100% reliable.

One issue I just thought off: Isn't it true that an UTF-8 file can always
legally start with a BOM?  I mean, the standard does allow this, doesn't it?
Then all UTF-8 aware applications should be able to handle it correctly.  This
mostly means they ignore the BOM, and handle it like a non-printing,
zero-width character.  Perhaps this should be added to tests for UTF-8
compability.  Of course, this doesn't change anything for applications that
are _not_ UTF-8 aware and the problems they might have with a BOM.

--
hundred-and-one symptoms of being an internet addict:
91. It's Saturday afternoon in the middle of may and you are on computer.

--/-/---- Bram Moolenaar ---- Bram@moolenaar.net ---- Bram@vim.org ---\-\--
  \ \    www.vim.org/iccf      www.moolenaar.net       www.vim.org    / /
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/