[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

ucat



I find two small scripts I wrote very helpful for migrating to UTF-8:

  $ textcoding file

reads out the coding of the file, if it is specified using the Emacs
local-variables convention (which is meant to be adopted by all editors and
text utilities)
 
  $ ucat file1 [file2 [ ...]]

uses textcoding to find out each file's encoding and gives out these files to
stdout.

`ucat' should probably be written using Bruno Haible's iconv library, but I
wrote mine as a collection of scripts.  `textcoding' should imho also be a
C program, but I am better at writing dirty scripts, and in the last iconvlib
I saw the iso-2022-7bit-ss2 and emacs-mule coding system were not supported
yet, so I have to use coco, which is in the Mule 2.3 package.

#!/bin/bash
# TEXTCODING: read out the coding system of a text file
# if it is specified according to the Emacs convention
# Otherwise return an error

FILE=$1;

fail () { echo $@ >&2;return 1; }

headgetcoding () { expr "$(head -n 1 $FILE)" : ".*-\*-.*\<coding: \([a-z,0-9,-]*\)\>.*-\*-"; }

tailgetcoding () { local LINE=$(tail -n 20 $FILE | grep "coding:") && expr "$LINE" : ".*\<coding: \([a-z,0-9,-]*\)\>"; }

{ test -f $FILE || fail "cant find $FILE"; } && 
CODING=$(headgetcoding $FILE || tailgetcoding $FILE) && 
echo $CODING

#!/bin/bash
usage () { cat << EOF 
UCAT
like cat, but read out coding from input files and convert to utf-8
uses coco from mule-2.3, Otfried Cheung's utf2mule and iconv

examples:
 
	ucat < text.i27 > text.u8
	ucat text1.i27 text2.u8 text3.i51

EOF
}

exitmsg () { local E=$?;if test $E = 0;then echo OK >&2;else { echo KO;echo -e "$MSG";usage; } >&2;fi;exit $E; }

fatal () { local E=$1;shift;MSG="$0: $@";exit $E; }

trap exitmsg EXIT;

mule2utf () { utf2mule -i; } 
coco2utf () { coco "*$1*" '*internal*' | mule2utf; }
iconvutf () { iconv -f $1 -t utf-8; }

test $# = 0 && { CATFILE=/tmp/ucat.$$;cat > $CATFILE;set -- $CATFILE; }

for FILE;do

  CODING0=$(textcoding $FILE) || { fatal 10 "no coding found for file $FILE"; } 
  CODING=$CODING0

  case $CODING in
    (internal|emacs-mule|mule) CONV=mule2utf;;
    (junet|iso-2022-7bit*) CODING=junet;CONV=coco2utf;;
    (autoconv) CONV=coco2utf;;
    (*) case $CODING in
	 (euc-china|cn-gb2312) CODING=euc-cn;;
	 (euc-japan) CODING=euc-jp;;
	 (euc-korea) CODING=euc-kr;;
	 (cn-big5) CODING=big5;;
	esac;
      CONV=iconvutf;;
   esac;

  eval $CONV $CODING < $FILE | sed "s/coding: $CODING0/coding: utf-8/" || fatal 3 "failed to convert $FILE from $CODING to utf-8";
 done;

rm -f $CATFILE

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/