[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
a patch to pine4.44 for a better UTF-8(I18N) support
Hi,
A couple of months ago, I made a patch against Pine 4.44 to make it
better support UTF-8 and I18N in general. I've been using it under
xterm-16x under ko_KR.UTF-8 locale to send all my outgoing messages
in UTF-8 and read incoming messages in various encodings (ISO-8859-x,
Windows-125x, KOI8-R/U, ISO-2022-JP, EUC-KR, UTF-8, and so forth).
About 20% of emails to this list archived on my local machine
were sent with a version of Pine so that I thought some of you
might be interested in my patch. The patch is available
at http://jshin.net/i18n/pine4.44.iconv.patch.
I have tested this only under Linux with glibc 2.2.x, but it should
also work under any Unix-like OS with Bruno's libiconv or any other OS
(where libiconv is ported.) My patch relies on that glibc/libiconvnv
implementation of iconv(3) does transliteration when '//TRANSLIT' is
added at the end of encoding names. (this dependency can be removed,
but I was lazy.) The default iconv(3) under OS' like Solaris8/9 may not
have this extension and won't work with my patch. And, this is linux-utf8
list so that I guess I can get away with that dependency here.
To compile it, you have to use
% ./build EXTRACFLAGS="-DHAVE_ICONV" target
Three configuration options are added. I got the idea for
two of them from Mutt 1.4.x/1.5.x
* assumed-charset : a lot of emails sent by non-standard compliant
MUAs/web mail programs have _raw_ 8bit characters (i.e.
not encoded per RFC 2047) in the message header.
Setting this to
the most common of them would help you read those
emails (subject, from, to, etc). For instance,
Western European users would want to set this
to ISO-8859-1/Windows-1252. Chinese(Simplified) users
would set
this to GB2312. This does NOT work for _untagged_ (
no MIME charset is specified in C-T header)
message body, yet. For untagged message body,
you have to
define the display filter for US-ASCII as following:
_CHARSET(US-ASCII)_ /usr/bin/iconv -c -f ISO-8859-1 -t UTF-8
or
_CHARSET(US-ASCII)_ /usr/bin/iconv -c -f Windows-1252 -t UTF-8
or
_CHARSET(US-ASCII)_ /usr/bin/iconv -c -f GB2312 -t UTF-8
* charset-aliases : Some MUAs use non-standard MIME charset names. For
instance, MS Outlook Express uses ks_c_5601-1987
for EUC-KR or CP949(X-Windows-949). You can
specify pairs of non-standard MIME charset
and standard MIME charset with each pair
delimetered by comma. In each pair, non-standard
charset name and standard name should be
delimetered by a colon. For instance, I have
ks_c_5601-1987:x-windows-949,ksc5601:x-windows-949
* iconv-aliases : Iconv codeset names are not standardized
and are not always the same as
the standard MIME charset names. For instance,
'x-windows-949' in glibc implementation of iconv
is 'mscp949' so that I have the following:
x-windows-949:mscp949,euc-kr:mscp949
Although EUC-KR is understood by glibc
implementation of iconv, I also have
'euc-kr:mscp949'
because some emails in X-Windows-949 is MISLABELLED
as in EUC-KR. X-Windows-949 (CP949) is upward
compatible with EUC-KR and there's no harm in
treating genuine EUC-KR text as X-Windows-949.
The same is the case of ISO-8859-1 and Windows-1252.
'iso-8859-1:windows-1252' may be added to work
around the problem. You can get the
identical effect by adding it to charset-aliases
list.
You also have to set 'character-set' to 'UTF-8' and run Pine in UTF-8
terminal (xterm-16x, putty Solaris dtterm under UTF-8 locale, etc).
In addition, you have to define a bunch of display filters because
my patch doesn't use iconv internally to do automatic encoding/MIME
charset conversion for the message body. However, it does automatic
conversion for the message header. I have the following defined
in my pinerc. I haven't checked yet whether '-c' option is
specified in SUS3/POSIX. It may be a glibc/libiconv extension.
display-filters=_CHARSET(EUC-KR)_ /usr/bin/iconv -c -f EUC-KR -t UTF-8,
_CHARSET(ks_c_5601-1987)_ /usr/bin/iconv -c -f MSCP949 -t UTF-8,
_CHARSET(US-ASCII)_ /usr/bin/iconv -c -f MSCP949 -t UTF-8,
_CHARSET(ISO-8859-1)_ /usr/bin/iconv -c -f Windows-1252 -t UTF-8,
_CHARSET(ISO-8859-15)_ /usr/bin/iconv -c -f ISO8859-15 -t UTF-8,
_CHARSET(ISO-2022-JP)_ /usr/bin/iconv -c -f ISO-2022-JP -t UTF-8,
_CHARSET(GB2312)_ /usr/bin/iconv -c -f GB2312 -t UTF-8,
_CHARSET(BIG5)_ /usr/bin/iconv -c -f BIG5 -t UTF-8,
_CHARSET(Windows-1251)_ /usr/bin/iconv -c -f WINDOWS-1251 -t UTF-8,
_CHARSET(Windows-1252)_ /usr/bin/iconv -c -f WINDOWS-1252 -t UTF-8,
_CHARSET(Windows-1253)_ /usr/bin/iconv -c -f WINDOWS-1253 -t UTF-8,
_CHARSET(ISO-8859-2)_ /usr/bin/iconv -c -f ISO8859-2 -t UTF-8,
_CHARSET(ISO-8859-3)_ /usr/bin/iconv -c -f ISO8859-3 -t UTF-8,
_CHARSET(ISO-8859-4)_ /usr/bin/iconv -c -f ISO8859-4 -t UTF-8,
_CHARSET(ISO-8859-5)_ /usr/bin/iconv -c -f ISO8859-5 -t UTF-8,
_CHARSET(ISO-8859-6)_ /usr/bin/iconv -c -f ISO8859-6 -t UTF-8,
_CHARSET(ISO-8859-7)_ /usr/bin/iconv -c -f ISO8859-7 -t UTF-8,
_CHARSET(ISO-8859-8)_ /usr/bin/iconv -c -f ISO8859-8 -t UTF-8,
_CHARSET(ISO-8859-9)_ /usr/bin/iconv -c -f ISO8859-9 -t UTF-8,
_CHARSET(ISO-8859-10)_ /usr/bin/iconv -c -f ISO8859-10 -t UTF-8,
_CHARSET(ISO-8859-11)_ /usr/bin/iconv -c -f ISO8859-11 -t UTF-8,
_CHARSET(ISO-8859-13)_ /usr/bin/iconv -c -f ISO8859-13 -t UTF-8,
_CHARSET(ISO-8859-14)_ /usr/bin/iconv -c -f ISO8859-14 -t UTF-8,
_CHARSET(ISO-8859-16)_ /usr/bin/iconv -c -f ISO8859-16 -t UTF-8,
_CHARSET(KOI8-R)_ /usr/bin/iconv -c -f KOI8-R -t UTF-8,
_CHARSET(KOI8-U)_ /usr/bin/iconv -c -f KOI8-U -t UTF-8,
_CHARSET(Windows-874)_ /usr/bin/iconv -c -f CP874 -t UTF-8,
_CHARSET(UTF-7)_ /usr/bin/iconv -c -f UTF-7 -t UTF-8
There are a couple of problems with my patch.
One of them is that I haven't done anything to fix 'one octet ->
one column width model'. In UTF-8, this false assumption completely
breaks down except for characters in US-ASCII(U+0020 - U+007E) as you
are well aware. Therefore,in the message display screen, lines are
wrapped prematurely and in the message index screen, headers (subject,
recipient, etc) are truncated prematurely.
The other is that somehow the link to 'email list management
information' at the end of a message with 'list management information'
header does not work. I guess it's easy to fix, but I haven't gotten
around to look into it yet.
There may be other problems as well. I'll be glad to hear about them,
although I may not be able to fix them as quickly as I wish to.
BTW, Pine 4.44 with my patch can also be run under non-UTF-8 terminal.
In that case, you have to set 'character-set' to the encoding of
your terminal (say, EUC-JP) and define your display filters accordingly.
My goal was to make Pine a text-terminal version of MS OE or
Mozilla-mail in terms of I18N support. With my patch, Pine got
closer to that goal, but is still far from it. Some of features
I want to see include:
- The encoding(MIME charset) for outgoing emails should be
decoupled from the encoding of a terminal under which Pine
is launched.
- It should be possible to change the encoding(MIME charset)
of outgoing messages _at the time of_ composition
(as is possible with MS OE and Mozilla-Mail.)
Although going all the way to UTF-8 is desirable,
the reality is that some of my correspondents cannot
deal with UTF-8 messages. For them, I have to
write in legacy encodings. Currently, I have to
launch another Pine with a separate pinerc to compose
my email in a legacy encoding.
- The internal encoding conversion (as opposed to relying on
users setting display filters correctly in pinerc) with iconv
- 'assumed-charset' should be settable per-folder basis as well as
globally.
Hope a lot of people find my patch useful,
Jungshik Shin
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/