[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: perl unicode support
On Tue, Mar 27, 2007 at 11:53:15PM -0400, SrinTuar wrote:
> 007/3/27, Daniel B. <dsb@xxxxxxxxx>:
> >What about when it breaks a string into substrings at some delimiter,
> >say, using a regular expression? It has to break the underlying byte
> >string at a character boundary.
>
>
> Unless you pass invalid utf-8
> sequences to your regular
Haha, was it your intent to use this huge japanese wide ascii? :)
Sadly I don't think Daniel can read anything but Latin-1...
Here's an ascii transliteration...
~Rich
On Tue, Mar 27, 2007 at 11:53:15PM -0400, SrinTuar wrote:
> 007/3/27, Daniel B. <dsb@xxxxxxxxx>:
> >What about when it breaks a string into substrings at some delimiter,
> >say, using a regular expression? It has to break the underlying byte
> >string at a character boundary.
>
> Unless you pass invalid utf-8 sequences to your regular expression
> library, that should be impossible. breaking strings works great as
> long as you pattern match for boundaries.
>
> The only time it fails is if you break it at arbitrary byte
> indexes.note that breaking utf-32 strings at arbirtrary indicies also
> destroys the text.
>
> >In fact, what about interpreting an underlying string of bytes as
> >as the right individual characters in that regular expression?
>
> The regular expression engine should be utf-8 aware. The code that
> uses and calls it has no need to.
>
> >Any time a program uses the underlying byte string as a character
> >string other than simply a whole string (e.g., breaking it apart,
> >interpreting it), it needs to consider it at the character level,
> >not the byte level.
>
> Only the most fancy intepretations require any knowledge of unicode
> code points.Any substring match on valid sequences will produce valid
> boundaries in utf-8,and thats the whole point.
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/