[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: gcc identifiers



I am not sure to be on-topic here.

Jungshik Shin wrote:
Basically ISO C99
seems to avoid problems arising from multiple representation issues by
allowing only precomposed characters in identifiers

Correct. This is not a C99 (or C++98) decision, but comes from WG20 (the WG in charge of internationalisation), which issued this "recommendation" in TR11076 (or is it TR10176). The "motivation" is to avoid as possible the problem of normalisation.

However there is some differences between both standards. In C, we
try (hard) to allow conforming implementations to be "UTFx dumb",
i.e. to have some encoding for Unicode on input, to accept any
character (how bullshit it may be, e.g \u03A2), and to stay conforming.
While at the same time we promote better implementations, able to
distinguish between "bullshit" characters and "correct" ones, and
reject the first ones. But to do that, the compiler should have a
huge knowledge of the "gray area" between them, and all the
compatibility problems, such as decomposition, digits, etc.

The result is that the minimum set of programs (the strictly
conforming ones, which should enforce all the rules of the standards;
this is intended to be the maximally portable ones, by the way),
should restrict themselves to a set of characters which is intended
to avoid any problem (at least, unless they are almost unavailable,
such as using variable саѕе, i.e. \u0441\u0430\u0455\u0435 ...)


If that's the case, characters like 'Latin Small Letter with Macron'
or 'Hangul Syllable Gga' for which there are alternate representations
should not be present in the list, but they are listed as allowed.

There is no problem with the restricted set, since the alternate representations are not allowed in portable programs. And "good" compilers, which extend the standard, are allowed to treat the alternate as identical to the precomposed version (i.e., they are allowed to use NKC).


  What ISO C99 seems to do is to shift the burden of normalization to
editors or whatever tool used by programmers to edit source files from
compilers and linkers.

You are missing the purpose of a programming language standard. It does not intent to "shift the burden". In fact, regarding the time we spent (and are still speding implementers) on this, versus the interest from the "customers", this is perhaps an overworked problem! But the content of the standard defines at the same time something (the minimum set, maximally portable), and framework to implement the actual solutions, in a way that should allow the better interoperation, and at the same time the easiest way to use it, and to implement it too (think about the C compilers for embeeded systems, which are required by law to support the ISO standard because of governments requirements, while no-one care for i18n characters... They really need a cheap solution.) Another solution are the "validators", i.e. compilers that only accepts the strictly conforming programs, logically to assure maximum portability. The problem is that the rules are so strict, that no useful programs (for example, one that uses "open()") can pass... GCC is defnitively something else, it aims at the better support of the standard, so they do _not_ want to implement the cheapest solutions, they really want something useful, which will go quite further than the minimum implementation, something that fulfills both the letter _and_ the spirit of the standard(s). And they will get it, but it will take some time.


Both ISO/IEC JTC1/SC2/WG2 and UTC would not encode
any more precomposed characters which can be represented with exisitng
base characters followed by one or more combining characters.

Well... Certainly they are not willing that. But sometimes they got
it wrong. Look at U+17A4 (Khmer QAA). I am sure other examples will come.
For the years that are coming, perhaps 10 years, Unicode/10646 will be evolving standards, so moving targets. We have to deal with that.


However,
'combining diacritical marks'(e.g. \u0300 - \u0362) are not allowed in
identifiers  so that 'any character' that's not encoded as a precomposed
form can't be used in identifiers.

Again: they *are* allowed (look after 6.4.3: only the \uxxxx for the basic set is explicitely forbidden). But programs that include them are not "strictly conforming", i.e. not maximally portable.


Antoine



-- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/