[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

C-Kermit + Unicode



C-Kermit (if you never heard of it) is a cross-platform,
transport-independent, scriptable communications program written
in C from the Kermit Project at Columbia University:

  http://www.columbia.edu/kermit/

C-Kermit 7.0 Beta.10 for UNIX (including Linux), plus VMS, Plan 9,
and AOS/VS was released a few days ago:

  http://www.columbia.edu/kermit/ck70.html

The main addition since Beta.09 (and hopefully the last major
addition before the final 7.0 release) is Unicode support.

Kermit protocol and software have included character-set
translation capabilities since the 1980s, allowing conversion of
text among the many "traditional" character sets like the ISO 8859
Latin Alphabets, PC code pages, IBM mainframe EBCDIC code pages,
ISO 646 national character sets, KOI sets, JIS sets, and assorted
proprietary sets (DEC, DG, Apple, NeXT, etc).  C-Kermit 7.0 adds
Unicode to the list:

 . UCS-2 and UTF-8 are now supported as transfer character sets
   (the small number of international standard character sets
   allowed "on the wire" in Kermit file transfer; each Kermit
   file-transfer partner converts between its local encoding
   and the transfer encoding) (UCS-2 and UTF-8 are two
   different representations of Unicode / ISO 10646).

(You might ask why UCS-2 is allowed as a transfer character set --
why not stick with UTF-8?  It's because CJK can be represented
more compactly in UCS-2.)

 . UCS-2 and UTF-8 are now supported as file character sets.
   Incoming text can be stored in either UTF-8 or UCS-2, and
   UCS-2 or UTF-8 text can be sent with conversion to any
   appropriate transfer character set (including conversion of
   UCS-2 to UTF-8 or vice-versa).  UCS-2 BOMs are handled as
   they should be, so "wrong-ended" UCS-2 files are still
   interpreted and sent correctly.  Incoming files, when stored
   as UCS-2, are given the appropriate BOM (unless you specify
   otherwise).

 . C-Kermit's TRANSLATE command can be used to convert
   traditional files to UCS-2 or UTF-8 (and, to the degree
   possible, vice versa) on the local computer, as well as
   between UCS-2 and UTF-8.

 . C-Kermit can conduct UTF-8 terminal sessions, even when its
   local character set is not Unicode.  (It is also programmed
   to do the reverse -- i.e. make connections from a UTF-8
   console or Window to a non-Unicode host, but I haven't been
   able to test this.  But theoretically, you should be able
   to use C-Kermit in a UTF-8 xterm window to make a connection
   to (say) a Latin-1 host, and have C-Kermit take care of all
   the conversion back & forth.)

 . C-Kermit's TRANSMIT command can perform "ASCII" (nonprotocol)
   uploads of text files, converting them to UTF-8 on the fly.
   Or it can upload UTF-8 or UCS-2, converting it to some other
   set, etc etc.

(Obviously whenever translating from Unicode to a smaller set,
Unicode characters that are not in the smaller set are lost, just 
like when converting from, say, Latin-1 to German ISO 646.)

C-Kermit 7.0 handles Unicode at ISO 10646 "Level 1" (roughly
equivalent to Unicode Normalization Form C), meaning there is no
particular support for combining characters (nor, for that matter,
for nonzero planes).  My initial thought was that the cost of a
database lookup and potential recursive canonical (de)composition
per character is a rather high price to pay in a telecommunications
application for a feature (character composition) that is not used
in Plan 9 and probably not in Linux either -- but I could be wrong!

(For example, it might be that some Windows NT applications might
perform canonical decompositions when storing Unicode textual data
-- I don't know -- which would cause problems when transferring
these files to platforms that support Unicode but not composed
characters, unless the transfer agent also converted to Normalization
Form C.)

The Web page lists all the other new features since the previous
release, 6.0, in September 1996.  Beta.10 has already been built
successfully on more than 130 different platforms (prebuilt
binaries are available and are listed at the end of the Web page;
if you can built others, please let me know).  Until a new edition
of the C-Kermit manual is published, the new features of version
7.0 are documented in the (plain text) ckermit2.txt file; Section
6.6 describes the new Unicode features:

  ftp://kermit.columbia.edu/kermit/test/text/ckermit2.txt

Comments and questions, especially on the new Unicode features,
are welcome.

Frank da Cruz
The Kermit Project
Columbia University
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/