[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: perl unicode support
Rich Felker wrote:
>
> On Tue, Mar 27, 2007 at 10:07:11PM -0400, Daniel B. wrote:
...
> >
> > What about when it breaks a string into substrings at some delimiter,
> > say, using a regular expression? It has to break the underlying byte
> > string at a character boundary.
>
> Searching for the delimeter already gives you a character boundary.
> There is no need to think further about it.
As long as you specified the delimiter properly (a whole character,
not a partial byte sequence).
> For example, the unix "cut" program works automatically with UTF-8
> text as long as the delimiter is a single byte,
By "single byte," do you mean a character whose UTF-8 representation
is a single byte? (If you gave it the byte 0xBF, would it reject it
as an invalid UTF-8 sequence, or would it then possibly cut in the middle
of the byte sequence for a character (e.g., 0xEF 0xBF 0x00)?)
> > > When I write a basic little perl script that reads in lines from a
> > > file, does trivial string operations on them, then prints them back
> > > out, there should be absolutely no need for my code to make any
> > > special considerations for encoding.
> >
> > It depends how trivial the operations are.
> >
> > (Offhand, the only things I think would be safe are copying and
> > appending.)
>
> This is because you don't understand UTF-8..
Bull. Try providing some real information (a couple of counterexamples).
Daniel
--
Daniel Barclay
dsb@xxxxxxxxx
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/