[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: nl_langinfo(CODESET) again
Edmund GRIMLEY EVANS wrote on 2000-05-24 08:23 UTC:
> does anyone have code for converting common nl_langinfo(CODESET) names into
> names that can be used in e-mail?
A pretty effective algorithm for normalizing character encoding names
into MIME charset names is attached below. Except for a very small
number of really widely established MIME encodings (namely US-ASCII and
ISO-8859-1), you are today far more likely to get your text correctly
displayed if you send it out in UTF-8 as opposed to whatever your local
legacy locale uses. For instance every Windows 2000 user is able to
display a very comprehensive Unicode repertoire, but not every Windows
2000 user has installed all the optional conversion tables from legacy
encodings. Only UTF-8 is guaranteed to be processable. The prudent thing
today is clearly to convert to UTF-8 on the sender's side, unless all
characters fit into US-ASCII or ISO-8859-1.
Do *not* send out ISO-8859-15 encoded MIME email, because orders of
magnitude fewer people will be able to display it correctly compared to
if you had sent it in UTF-8.
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
#!/usr/bin/perl
# Read the name of a character encoding from stdin and transform it into
# the corresponding standardized MIME charset name, as registered on
# (or pipelined for) ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
# Markus Kuhn <mkuhn@xxxxxxx> -- 2000-05-24
while (<STDIN>) {
tr/a-z/A-Z/;
if (/8859[-_\/](\d+)(:.*)?$/) {
print "ISO-8859-$1\n";
} elsif (/LATIN[-_\/]?([1-4])$/) {
print "ISO-8859-$1\n";
} elsif (/LATIN[-_\/]?CYRILLIC/) {
print "ISO-8859-5\n";
} elsif (/LATIN[-_\/]?ARABIC/) {
print "ISO-8859-6\n";
} elsif (/LATIN[-_\/]?GREEK/) {
print "ISO-8859-7\n";
} elsif (/LATIN[-_\/]?HEBREW/) {
print "ISO-8859-8\n";
} elsif (/LATIN[-_\/]?5$/) {
print "ISO-8859-9\n";
} elsif (/LATIN[-_\/]?6$/) {
print "ISO-8859-10\n";
} elsif (/LATIN[-_\/]?7$/) {
print "ISO-8859-13\n";
} elsif (/LATIN[-_\/]?8$/) {
print "ISO-8859-14\n";
} elsif (/LATIN[-_\/]?9$/) {
print "ISO-8859-15\n";
} elsif (/LATIN[-_\/]?10$/) {
print "ISO-8859-16\n";
} elsif (/UTF[-_\/]?8/ || /UTF$/) {
print "UTF-8\n";
} elsif (/(WINDOWS|WIN|CP|DOS|IBM|MSDOS)[-_\/]?(\d+)$/) {
print "windows-$2\n";
} elsif (/ASCII/ || /X3\.4/ || /[^\d]646[\.-_\/]?IRV/) {
print "US-ASCII\n";
} elsif (/2022[-_\/]?([A-Z\d-]+)(:.*)?$/) {
print "ISO-2022-$1\n";
} elsif (/SHIFT[-_\/]?JIS/) {
print "Shift_JIS\n";
} elsif (/BIG[-_\/]?5$/) {
print "Big5\n";
} else {
print;
}
}