[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: nl_langinfo(CODESET) again



Edmund GRIMLEY EVANS wrote on 2000-05-24 08:23 UTC:
> does anyone have code for converting common nl_langinfo(CODESET) names into
> names that can be used in e-mail?

A pretty effective algorithm for normalizing character encoding names
into MIME charset names is attached below. Except for a very small
number of really widely established MIME encodings (namely US-ASCII and
ISO-8859-1), you are today far more likely to get your text correctly
displayed if you send it out in UTF-8 as opposed to whatever your local
legacy locale uses. For instance every Windows 2000 user is able to
display a very comprehensive Unicode repertoire, but not every Windows
2000 user has installed all the optional conversion tables from legacy
encodings. Only UTF-8 is guaranteed to be processable. The prudent thing
today is clearly to convert to UTF-8 on the sender's side, unless all
characters fit into US-ASCII or ISO-8859-1.

Do *not* send out ISO-8859-15 encoded MIME email, because orders of
magnitude fewer people will be able to display it correctly compared to
if you had sent it in UTF-8.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

#!/usr/bin/perl
# Read the name of a character encoding from stdin and transform it into
# the corresponding standardized MIME charset name, as registered on
# (or pipelined for) ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
# Markus Kuhn <mkuhn@xxxxxxx> -- 2000-05-24

while (<STDIN>) {
    tr/a-z/A-Z/;
    if (/8859[-_\/](\d+)(:.*)?$/) {
	print "ISO-8859-$1\n";
    } elsif (/LATIN[-_\/]?([1-4])$/) {
	print "ISO-8859-$1\n";
    } elsif (/LATIN[-_\/]?CYRILLIC/) {
	print "ISO-8859-5\n";
    } elsif (/LATIN[-_\/]?ARABIC/) {
	print "ISO-8859-6\n";
    } elsif (/LATIN[-_\/]?GREEK/) {
	print "ISO-8859-7\n";
    } elsif (/LATIN[-_\/]?HEBREW/) {
	print "ISO-8859-8\n";
    } elsif (/LATIN[-_\/]?5$/) {
	print "ISO-8859-9\n";
    } elsif (/LATIN[-_\/]?6$/) {
	print "ISO-8859-10\n";
    } elsif (/LATIN[-_\/]?7$/) {
	print "ISO-8859-13\n";
    } elsif (/LATIN[-_\/]?8$/) {
	print "ISO-8859-14\n";
    } elsif (/LATIN[-_\/]?9$/) {
	print "ISO-8859-15\n";
    } elsif (/LATIN[-_\/]?10$/) {
	print "ISO-8859-16\n";
    } elsif (/UTF[-_\/]?8/ || /UTF$/) {
	print "UTF-8\n";
    } elsif (/(WINDOWS|WIN|CP|DOS|IBM|MSDOS)[-_\/]?(\d+)$/) {
	print "windows-$2\n";
    } elsif (/ASCII/ || /X3\.4/ || /[^\d]646[\.-_\/]?IRV/) {
	print "US-ASCII\n";
    } elsif (/2022[-_\/]?([A-Z\d-]+)(:.*)?$/) {
	print "ISO-2022-$1\n";
    } elsif (/SHIFT[-_\/]?JIS/) {
	print "Shift_JIS\n";
    } elsif (/BIG[-_\/]?5$/) {
	print "Big5\n";
    } else {
	print;
    }
}