[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: perl unicode support



> Once you have a regex library that handles codepoints, the code that uses
> it doesnt have to care about them in particular.

It's not so simple. Suppose you have a byte sequence (decimal) 65 195 129
66. (This is the beginning of the Hungarian alphabet AÁB... encoded in
UTF-8).

Why is it not so simple?I just want to know some basic information: Does it match or not. What range of bytes in the string was matched.

I don't care what the regex library does under the covers, and I
shouldnt have to care...
I can safely extract substrings on those boundaries now if it did its job right.

If it knows how to match "Á" to ".", then I dont have to know how it
goes about doing so.
Even better if the regex engine handles both normalization forms
transparently. My code should never have to care. I shouldnt have to
jump through hoops, and call all sort of fancy "binmode" settings or
perform "Encode::decode" incantantions everywhere to turn my scalars
back into plain old strings.
.)îÅDÅò-|ž‡ˊ{±¢v¥–W¯z[­Èb½èm¶Ÿÿ™¨¥žYbžìh®åŠ{±º×ü