[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
ucat
I find two small scripts I wrote very helpful for migrating to UTF-8:
$ textcoding file
reads out the coding of the file, if it is specified using the Emacs
local-variables convention (which is meant to be adopted by all editors and
text utilities)
$ ucat file1 [file2 [ ...]]
uses textcoding to find out each file's encoding and gives out these files to
stdout.
`ucat' should probably be written using Bruno Haible's iconv library, but I
wrote mine as a collection of scripts. `textcoding' should imho also be a
C program, but I am better at writing dirty scripts, and in the last iconvlib
I saw the iso-2022-7bit-ss2 and emacs-mule coding system were not supported
yet, so I have to use coco, which is in the Mule 2.3 package.
#!/bin/bash
# TEXTCODING: read out the coding system of a text file
# if it is specified according to the Emacs convention
# Otherwise return an error
FILE=$1;
fail () { echo $@ >&2;return 1; }
headgetcoding () { expr "$(head -n 1 $FILE)" : ".*-\*-.*\<coding: \([a-z,0-9,-]*\)\>.*-\*-"; }
tailgetcoding () { local LINE=$(tail -n 20 $FILE | grep "coding:") && expr "$LINE" : ".*\<coding: \([a-z,0-9,-]*\)\>"; }
{ test -f $FILE || fail "cant find $FILE"; } &&
CODING=$(headgetcoding $FILE || tailgetcoding $FILE) &&
echo $CODING
#!/bin/bash
usage () { cat << EOF
UCAT
like cat, but read out coding from input files and convert to utf-8
uses coco from mule-2.3, Otfried Cheung's utf2mule and iconv
examples:
ucat < text.i27 > text.u8
ucat text1.i27 text2.u8 text3.i51
EOF
}
exitmsg () { local E=$?;if test $E = 0;then echo OK >&2;else { echo KO;echo -e "$MSG";usage; } >&2;fi;exit $E; }
fatal () { local E=$1;shift;MSG="$0: $@";exit $E; }
trap exitmsg EXIT;
mule2utf () { utf2mule -i; }
coco2utf () { coco "*$1*" '*internal*' | mule2utf; }
iconvutf () { iconv -f $1 -t utf-8; }
test $# = 0 && { CATFILE=/tmp/ucat.$$;cat > $CATFILE;set -- $CATFILE; }
for FILE;do
CODING0=$(textcoding $FILE) || { fatal 10 "no coding found for file $FILE"; }
CODING=$CODING0
case $CODING in
(internal|emacs-mule|mule) CONV=mule2utf;;
(junet|iso-2022-7bit*) CODING=junet;CONV=coco2utf;;
(autoconv) CONV=coco2utf;;
(*) case $CODING in
(euc-china|cn-gb2312) CODING=euc-cn;;
(euc-japan) CODING=euc-jp;;
(euc-korea) CODING=euc-kr;;
(cn-big5) CODING=big5;;
esac;
CONV=iconvutf;;
esac;
eval $CONV $CODING < $FILE | sed "s/coding: $CODING0/coding: utf-8/" || fatal 3 "failed to convert $FILE from $CODING to utf-8";
done;
rm -f $CATFILE
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/