[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
plain text, paragraphs, and bidi
What is the right unit for applying the bidi algorithm to UTF-8
encoded plain text? A line (delimited by a newline character) or a
paragraph (normally delimited by two consecutive newline characters)?
The Unicode technical report describing the Bidi algorithm specifies
the paragraph.
OTOH, the fundamental unit for all Unix text utilities (grep etc.) is
the line.
It makes a difference when a piece of text has right-to-left
embeddings that span a newline. For example,
english text bla bla ... RLE hebrew text first part <newline>
hebrew text second part PDF back to english text <newline>
mixed english and hebrew without explicit direction <newline>
The Unicode Bidi algorithm will treat the [RLE ... PDF] part
specially. If now, through "sed" or "grep", the second line is
removed, and the remaining two lines are rendered, the last line will
be rendered differently, because it is under the effect of the RLE
marker.
Note that I'm not talking about automatic line wrapping, done in xterm
etc. I'm talking about the case where the newline characters are
already present in the original text.
Bruno
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/