[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Comments on ISO PDTR 14652
Keld Simonsen wrote on 2000-10-12 17:42 UTC:
> The current status of the draft is that it exists a PDTR draft available
> via http://www.dkuug.dk/jtc1/sc22/WG20/docs/projects
Thanks!
Some comments on ISO PDTR 14652 from
http://anubis.dkuug.dk/jtc1/sc22/WG20/docs/n690.pdf
a) In line 1603, please make clear that the repertoire map is optional. In
practical implementations such as glibc 2.2, no repertoire maps
will be used any more. All characters will be defined exclusively in
the form <Uxxxx>. Repertoire maps are an archaic and obsolete pre-UCS
concept that should never lead to mandatory elements of the syntax
anywhere. Strings in locales should either be specified in <Uxxxx>
notation for maximum portability, or in UTF-8 for maximum readabiliy.
Repertoire maps have nothing practically useful to add to these two
options.
b) In section 4.3.2.3, the description of the semantics of keywords
"default_missing" and "translit_ignore" is incomplete, ambiguous
and confusing. I haven't understood what "translit_ignore" is good
for. Please don't explain it to me, instead rewrite the document such
that there can be no doubt for me how I have to implement this.
c) In section 4.3.2, there is at the moment no description of a proper
step-by-step algorithm for how transliteration has to be performed
according to the data supplied in these keywords (especially
"default_missing" and "translit_ignore"). With the current formulation,
each implementor will come up with something very different. What does
"ignore" mean for example? Substitution with the empty string? Is
there any difference between ignoring a character and not providing
a transliteration statement for it? (I can suggest one plausible
transliteration algorithm, but I'd first like to read what you had
in mind originally.)
d) Can included transliteration statements redefine previous ones?
This is one of the many questions about the unspecified transliteration
algorithm that the spec currently does not answer.
e) What is "combining" and "combining_level3" good for? These sets seem
to be only meaningful in one single coded character set, namely UCS,
and there they are hardwired into the respective latest edition of
the ISO 10646 standard. There is no cultural dependency at all here,
so "combining" and "combining_level3" clearly have no place in a cultural
convention specification. They are just fixed properties of a single
standard.
f) wcwidth() and wcswidth() depend on cultural conventions and
transliteration but I haven't seen any provisions for the necessary
tables. These would be much more important than "combining" and
"combining_level3".
g) I section 4.3.2.1, I have great worries about the idea that the
<transliteration_source> string can be more than one character long.
This leads to an endless series of implementation problems and should
definitely better be dropped. For example, the C99 standard requires
all the wide-character to multi-byte conversion (that is where in the
C library the transliteration would have to be hooked in) to be equivalent
as if done by calls to wcrtomb(). However, wcrtomb() is required to
swallow a wide character immediately and spit out the corresponding
multi-byte sequence (ISO C99, section 7.24.6.3.3). There is no room for
buffering wide characters until it becomes clear what the longest
<transliteration_source> string is at the current position in the
wide character stream. The mbstate_t value only keeps state in the
sequence of multi-byte characters, not in the sequence of wide characters.
Otherwise, the semantics of the file positioning functions would be
messed up completely. Please please remove the option of transliterating
strings into strings. It sounds neat at first, but clearly wasn't
carefully thought through and obviously is not based on implementation
experience. Single-character to string transliteration is however no problem
at all, because this is very similar to wide-character to multibyte-character
conversion and therefore C99 has already all the necessary infrastructure
in place.
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/