[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
UTF-8 line feeds versus LS/PS
Frank da Cruz wrote on 1999-09-17 18:55 UTC:
> > Example: Imagine, you have a shell script that tests whether a submitted
> > UTF-8 message is not longer than 20 lines. With correct UTF-8 usage, "wc
> > -l" does not have to be modified to process UTF-8 files, because all it
> > does is counting LF characters (bytes) in the file.
> >
> Unless somebody is using LS or PS rather than LF in their Unicode files :-)
>
> This also presupposes (as UNIX itself does, in general, which I believe
> to be a Good Thing) that "plain text" is "preformatted" -- as distinct from
> the Microsoft idea of plain text, in which a "line" is really a "paragraph",
> and assumes that all "plain text" is fed through some sort of "rendering
> engine" for viewing by humans.
Due to the ASCII compatibility requirement, UTF-8 plain-text files under
POSIX systems will remain LF-terminated sequences of lines, exactly as
it was with ASCII, ISO 8859, etc. No LS/PS ever.
The Unicode LINE SEPARATOR and PARAGRAPH SEPARATOR control codes make
sense inside word processor file formats, database fields, etc., but
they are not a replacement for the good old Unix \n, which is hardcoded
into basically every Unix program on this planet. They are just not
ASCII compatible. There is nothing wrong with using LS/PS in UCS-2/
UTF-16 environments where ASCII compatibility does not matter, but they
really do not go well with environments for which UTF-8 was designed.
Side remark:
It would indeed be nice to also introduce under Unix a text format,
where paragraphs are formatted at display time (like Word does), and
where soft linebreaks inside paragraphs are not saved to the file. The
main advantage here is that diffs become significantly compacter
(assuming they would operate on byte ranges, not on lines), because
changing a few words followed by reformatting a paragraph moves around
all these LF bytes that then the revision control system has to take
track of, which is not very elegant at the moment.
It would indeed be very helpful, if emacs, vim, less, etc. had a mode
similar to the Windows notepad and Word, where paragraphs are
essentially long lines without any LF in them. LF-free paragraphs would
especially be convenient for editing plaintext-files that will later be
reformatted anyway and where line length doesn't matter at all, e.g.
HTML and TeX.
However, all this is again *completely* independent and orthogonal to
Unicode. Unformatted plain-text files would also be nice with just
ASCII, and LF is as good a paragraph separator as Unicode's PS. I'd
rather not use LS and PS at all on POSIX systems, because it would break
a tremendous amount of software, even though I do appreciate that the
clearly-defined LS/PS semantics does have its attractions and is much
nicer in UCS-2 files than the historic CR/LF/NL mess.
http://www.unicode.org/unicode/reports/tr13
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/