[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Correct use of UTF-8 under Unix



Title: RE: Correct use of UTF-8 under Unix

Hi!

        Larry is right in that there is (already, also under Unix)
other ways of separating lines: namely form feed, but also vertical
tab. I must admit that I have never used vertical tab, and very
rarely form feed... Anyway C9x says: "\v (vertical tab) Moves the
active position to the initial position of the next vertical tab
position." And there is a similar statement about form feed. I
assume that is not too far off from what other standards might say. 

        So, the interoperable line (or 'stronger') separators in
"plain text" are:

        \X{2028}|\X{2029}|\r\n|\n|\r|\f|\v|\X{85}

(I'm probably mixing Perl and C (and flex) syntax here.) Some
of them are "stronger" in some senses than line separation,
but for the purposes of counting logical lines, and deciding
logical line begin and logical line end, there should be no
difference.  A single logical line may be *dynamically* wrapped
into several displayed lines, but that is a different matter.

        Note that there are some "legacy" encodings which do not
have any or all of \f|\v|\X{85}.

        (I still think the idea of having two different kinds
of "plain text" is a bad idea.  I haven't heard anyone else
entertain it either.)

                Kind regards
                /Kent K


Larry Wall wrote:
...
> The only problem I see offhand with allowing both styles in the same
> file is that different tools might count lines differently.  If Perl
> says there's a syntax error at line 582, it might mean it has seen 581
> instances of /\012 | \015\012 | \015 | \X{2028} | \X{2029}/x
> before the
> error.  (For folks listening in, that works out to Unix
> newline, Windows
> newline, Mac newline (!), Unicode line separator and Unicode paragraph
> separator.)  If your "normal plain text" editor then counts only \012
> (Unix newline), the programmer isn't going to be able to find
> the error.
>
> On the other hand, maybe Perl would just count newlines, and your
> editor counts it the other way.  More likely, some editors count one
> way, and other editors count another.  Maybe they count LS but not PS,
> just as Perl currently counts \n but not \f as a line transition.
> There are many possiblities.
>
> All I'm really arguing here is that it would be good to establish a
> line counting convention.  But if that convention pretends there won't
> be files mixing the two line delimitation styles, that will have other
> ramifications, including possibly an adverse impact on portability.
> Counting line numbers right is already pretty complicated
> when you have
> NFS mounts from foreign systems.  Adding in Unicode will only make
> things more complicated.  There will be some pressure to use Unicode
> LS/PS in portable code, and I'm not sure you want to spend the rest of
> your life resisting that pressure.  A lot of the "fixes" in Perl are
> only there because we got tired of people asking the same questions
> over and over.
>
> I think assuming that files will only be one style or the other will
> put us into that sort of a situation, and it would be nice to head it
> off early, for some definition of early.  Just telling people by fiat
> that they can't mix the two styles is not likely to work in
> the absence
> of universal education.  Unfortunately, the education of the
> illegitimi
> tends to result in carborundum.
>
> Larry
> -
> Linux-UTF8:   i18n of Linux on all levels
> Archive:      http://mail.nl.linux.org/lists/
>