[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: perl unicode support
> Once you have a regex library that handles codepoints, the code that uses
> it doesnt have to care about them in particular.
It's not so simple. Suppose you have a byte sequence (decimal) 65 195 129
66. (This is the beginning of the Hungarian alphabet AÁB... encoded in
UTF-8).
Why is it not so simple?I just want to know some basic information:
Does it match or not. What range of bytes in the string was matched.
I don't care what the regex library does under the covers, and I
shouldnt have to care...
I can safely extract substrings on those boundaries now if it did its job right.
If it knows how to match "Á" to ".", then I dont have to know how it
goes about doing so.
Even better if the regex engine handles both normalization forms
transparently. My code should never have to care. I shouldnt have to
jump through hoops, and call all sort of fancy "binmode" settings or
perform "Encode::decode" incantantions everywhere to turn my scalars
back into plain old strings.
.)îÅDÅò-|ž‡ˊ{±¢v¥–W¯z[ Èb½èm¶Ÿÿ™¨¥žYbžìh®åŠ{±º×ü