[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Announcing Bytext
Ack. This text isn't wrapped. Quotes-at-the-bottom were bad enough.
On Sun, Feb 03, 2002 at 05:57:28AM -0800, Bernard Miller wrote:
> I don’t mean to imply that David Starner is an idiot, there are other reasons why someone might not understand Bytext, from the trivial (no time, no interest) to the less trivial (learning styles, documentation errors, etc). I’ve changed some wording based on his specific concerns and I would appreciate other specific comments. The issue with the statements he quotes is basically that Bytext strips various characters like combining characters of all their properties except for a name and a property that maps them to a Unicode character. So combining characters sort of exist (for compatibility) and sort of do not (as fully defined characters). Bytext can be thought of as an excercise in massive precomposition, an attempt to eliminate the need for combining characters and formatting characters and grapheme clusters. Precomposition is the spirit of the W3C character model, Bytext simply takes this to it’s logical conclusion. It simplifies many text processes, especially for syllable oriented scripts like Devanagari. It may seem to involve too many characters, but it is finite and thus considerably less than the infinite number of abstract characters in Unicode. Also, there is a logic to the way the characters are formed with bytes that makes it easy to process algorithmically, it’s not just a huge list of characters.
If this format isn't two-way compatible with Unicode (as well as all of
the major character sets Unicode is two-way compatible with), it's got
another compatibility strike against it.
> About people having an emotional attachment to Unicode, I’m not necessarily referring to people on this thread. Perhaps David has emotional issues with bad typography, maybe he was abused as a child by poor documentation ;-) Nah, but what else other than emotion can explain it when minor spelling errors are characterized as “inconsistencies” (nevermind that Bytext errata has no place in the Unicode mailing list); or the various hostile comments only minutes after it was announced; or the knee-jerk ridicule of new characters I proposed which later received serious consideration by other members; or the many people who took offense at the mere implication that they should find it interesting? Character encoding as a science is kind of like arithmetic, one doesn’t expect a lot of major new developments --but things like lambda calculus still come along many years later. If someone implementing an arithmetic library doesn’t even find lambda calculus interesting and refuses to even read about it, I would say something is missing from that person, perhaps they are prime candidates for being replaced by a robot. The same goes with those that are implementing Unicode...
I saw no knee-jerk responses on this list, and this is the one I
currently read.
> As for ASCII transparency (a more appropriate word than compatibility) and the general notion of how complex Bytext is compared to Unicode, there are 2 important concepts to take note of: The first is that making things easier for the user will USUALLY involve making things more difficult for the developer. You can’t expect a user to shed a tear for a developer, the user simply wants the best thing possible. Surely no one is suggesting that Bytext is IMPOSSIBLE. It is a headache to implement any new feature, but it is also an opportunity for growth.
Lack of that means lack of an upgrade path, which means impracticality.
> I propose that fast and intuitive regular expressions are a feature that will not lose importance because no matter how fast computers get, the amount of data that needs to be searched can easily grow even faster. The first step of any search, even a database search, is a regex. Not only do regexes need to be fast, they also need to be intuitive because nowadays regexes are composed by ordinary people. Open composition searching (a feature of Bytext) is incredibly intuitive: you can search for components of characters the same way you can search for components of words. It all but eliminates the need for case folding or what might be called “diacritic folding”. It also puts native Unix technologies (8 bit regexes) back in the forefront. If you want another compelling reason for Bytext, read the section on OBS characters.
Impracticality will kill any format, regardless of what it provides.
> The other thing to take note of is the notion of absolute complexity vs relative complexity. Because of the lack of ASCII transparency, Bytext may be arguably more difficult to implement on a trival level than UTF-8 on ASCII based systems (it may have more relative complexity). But consider that many peoples of the world actually want to use their native scripts in protocols and functions. To say that being able to automatically ignore non ASCII codes is why UTF-8 is better is an affront.
An affront? It's a purely practical matter. Unixes in general are 7-bit
ASCII by default. Nobody is suggesting to ignore non-ASCII codes when a
program is multibyte-aware, we're saying that you don't have to convert
*all of your programs at once* to use it at all, which you have to do
for all ASCII-incompatible character sets. Doing that would be
completely impossible, and, in the real world, simply won't happen.
> Not doing a proper conversion of charsets is clearly a hack, like programming without type safety --not always a bad thing but certainly shouldn’t be imposed on everyone.
Incorrect. Since ASCII is a subset of UTF-8, no conversion is necessary, so
any such conversion would simply do nothing. (The same is true of all
character sets which are a superset of ASCII--which is most of them, for
the same practical reasons.) This is by design, of course; ASCII is the
common denominator, which makes it possible to transition to more useful
character sets.
If a textfile is correct ASCII, and the user's locale is UTF-8, the textfile
is correct UTF-8; all of the available codepoints are well-defined and
nothing is assumed. This isn't a hack, this is by the design of UTF-8 and
Unicode.
> Many of the elegant features of Unixes depend on the notion of 8 bit transparency: pipe, cat, echo... the byte stream is the common denominator. The functions are general purpose and thus more useful. Bytext takes this elegant notion to it’s logical conclusion: not only can you process text as bytes, you can also process bytes as text. By default, everything is preserved and there are no special sequences to worry about. You can open ANY file as a text file and scan it for troubleshooting information or just as a way of trying to visually deduce what kind of file it is. It is useful to apply regular expressions and various functions like “diff” to arbitrary binary data not only using the same familiar functions, but also within the same familiar application --your text editor.
In practice, it's no big deal to open binary files with a decent text editor.
Vim handles it just fine. (The major issue is handling NULs, and that
won't be helped by any encoding.)
If I want to grep a binary file, I use "strings file | grep" (or "nm" or
whatever), and I only grep the stuff that's useful to grep. Same for diff.
(David:)
> How about an example? Say, "ᎰᎵ hat Musik gut gehört." What does that
> look like bytewise in Bytext?
A distinct advantage of replying in the common style is it helps
responses; I don't think you answered this question, at least on this
list.
--
Glenn Maynard
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/