[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Counting "characters".
On Tue, 2 Apr 2002 17:43:27 -0500
Glenn Maynard <g_lutf8@xxxxxxxx> wrote:
> > mbslen counts the number of characters where a "character" is
> > something I still need to define.
>
> And which definition is useful is very dependent on what you need it
> for. I'd suggest figuring out the different uses you'd expect, and
> defining functions based on that. (Defining a function and then finding
> uses for it is backwards.)
>
> I'm assuming you don't have a specific application in mind, since you
> didn't answer Markus's question.
Ok, here's an example. The Document Object Model W3C spec describes some
'CharacterData' methods:
http://www.w3.org/TR/1998/REC-DOM-Level-1-19981001/level-one-core.html#ID-FF21A306
My C implementation of this spec has functions for these methods like:
DOM_String *DOM_CharacterData_substringData(DOM_CharacterData *data, int offset, int count);
void DOM_CharacterData_deleteData(DOM_CharacterData *data, int offset, int count);
These offset and count parameters are described like 'The number of
characters to extract' or 'The character offset at which to insert'
etc. THe DOM API is one of these XML peripherals and so the 'Char'
type ultimately defined in the XML spec here:
http://www.w3.org/TR/REC-xml#charsets
Which at one point has an actual "definition":
[Definition: A character is an atomic unit of text as specified by
ISO/IEC 10646 [ISO/IEC 10646] (see also [ISO/IEC 10646-2000]).
But these XML specs are unavoidably bound to the Java language so I think
Java's substring, charAt, and indexOf methods have a lot of influence
here.
I guess I should lookup 'atomic unit of text' in the ISO-10646 doc. That
sounds interesting.
Thanks,
Mike
--
May The Source be with you.
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/