[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: ucat



I would like to propose that some efforts be made to officially elevate
two conventions which so far have been more or less confined to the Emacs
world, to more broadly adoptable standards.

(1) The coding systems iso-2022-7bit and emacs-mule
(2) The file variable mechanism

We discussed this and I am still not convinced.

Bruno Haible wrote:

> > in the last iconvlib
> > I saw the iso-2022-7bit-ss2 and emacs-mule coding system were not supported
> 
> 'iso-2022-7bit-ss2' and 'emacs-mule' are by no means standardized
> abbreviations. The only programs that understand them are emacs and xemacs.
> If a user wishes to operate with other programs than Emacs, he should use
> standardized names. For example, your 'iso-2022-7bit-ss2' files are probably
> ISO-2022-CN or ISO-2022-CN-EXT or ISO-2022-JP-2. These encodings are supported
> by libiconv, and (mostly) also by glibc's iconv, Microsoft browsers etc.

If I treat my iso-2022-7bit files as iso-2022-jp-2 (which is the closest
of the above listed), a lot of data are lost, and it would be impossible
to encode these data (e.g. cn-big5) using iso-20222-jp-2.
 
iso-2022-7bit and emacs-mule are more powerful than any of the above
listed 'standard' codings, and they are well documented and the
practically most important systems for multilingual work on the GNU/Linux
platform.  Multilingual work on this platform until now unfortunately
still means work on the Emacs platform.

iso-2022-7bit can even, as far as I can see, not be completely replaced by
UTF-8.  There are some programs, like my pquail input systems, that need
to distinguish between the source character sets which Han characters
belong to.  With UTF-8, the famous round-trip-conversion breaks in this
case.  

For examle, I have a file with Han docstrings that are to be displayed in
various coding environments.  With UTF-8, these are unified into one form,
and it is impossible to convert them back to a form that distinguishes
which belongs to which coding system.

This is a very rare case, and it is only due to the fact that I want to
specifically support a certain set of legacy coding systems.  Yet this
case does occur, and it can be dealt with adequately only with
iso-2022-7bit.
 
> > the coding of the file, if it is specified using the Emacs
> > local-variables convention (which is meant to be adopted by all editors and
> > text utilities)
> 
> I beg to disagree. It is part of the Emacs philosophy that Emacs has the
> right to introduce its own conventions, like the -*-XXX-*- line in a file,
> because "all a user needs is Emacs". This does not mean that every other
> program has to follow and copy Emacs inventions. It only means that people
> which rely on these Emacs inventions to work cannot use standard utilities
> and programs.

Which standard utilities and programs?
There is no 'standard' that performs what the above-mentioned Emacs
convention performs.

In the absence of official standards, a well-designed and well-documented
convention that does the work takes the place of a standard.

Btw, I have written a more accurate and fast version of the 'textcoding'
script, this time in Scheme Shell.  You find it appended here. 

I tested it on all kinds of conforming and non-conforming (even binary)
files, and it seems to produce usable results.

--
phm
#!/usr/local/bin/scsh -s
!#
; find out the coding of a textfile that observes the Emacs File Variables convention
; (see the chapter on 'file variables' in the Emacs Info manual)

(define displine (lambda (str) (display (string-append str (string #\newline)))))
(define fatal (lambda (err str) (displine str) (exit err)))

(and (null? command-line-arguments) (fatal 5 "need 1 argument"))

(define file (car command-line-arguments))

(or (file-readable? file) (fatal 4 "no such file"))

(define textfile? (lambda (f) (regexp-search (rx "text") (run/string (file ,f)))))

(or (textfile? file) (fatal 3 "not even a text file"))

(define size (file-size file))

(define codingre (rx (: bow "coding: " (submatch (** 0 10 (| alphanum "-"))) eow)))

(define port (open-input-file file))
(define match #f)

(define headgetcoding 
  (lambda () (let ((line (read-line port)))
    (set! match (regexp-search (rx (: "-*- " (submatch (+ any)) " -*-")) line))
    (and match   
	 (set! match (regexp-search codingre line) (match:substring match 1)))
    match)))

(define eofsearch 
  (lambda (re) 
    (let loop ((line (read-line port)))
      (cond
       ((not (string? line)) #f)
       ((begin (set! match (regexp-search re line)) match) #t)
       (else (loop (read-line port)))
       ) ) ) )

(define tailgetcoding 
  (lambda ()
    (seek port (max 0 (- size 800)))
    (and (eofsearch (rx "Local Variables:")) (eofsearch codingre))
    ) )

(if (or (headgetcoding) (tailgetcoding)) 
  (displine (match:substring match 1))
  (displine "no coding found") 
)

(close port)

(exit (if match 0 1))

; Local Variables:
; coding: utf-8
; mode: scheme
; End: