[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Proposal for 2 Byte Unicode implementation in gcc and glibc
Hi everybody
in the following we present a proposal for 2 byte Unicode support in
gcc and libraries.
Motivation for our proposal:
when comparing applications using 4 Byte Unicode running on Linux, with
similar applications using 2 Byte Unicode on other platforms
(e.g. Win...) Linux will always show the worst performance.
Even an otherwise superior OS performance can not compensate the
additional requirements in memory bandwidth, CPU, disk space etc.. One
simple example: for a typical database used in medium sized companies of
about 100 GB, we find a ratio of about 70 percent strings to 30 percent
data. The transition to 2 byte Unicode would increase the disk space to
(2*70 + 30) % = 170 %. If we change to 4 byte Unicode the same database
would increase by 310 %.
If we want Linux to become a major and globally usable platform we
strongly believe that we can not sustain this inflation, i.e. we have to
provide an additional - 2 byte - implementation of Unicode. At least
the programmer must be free to choose which way they want their programs
to work. Otherwise Linux will be on the wrong track.
Please see the following text for some detailed information and
the attachment for our full proposal:
****************************************************************************
*****************************************
The next version of our business applications will be offered with the
an Unicode option. The software is programmed with 2 byte unicode
characters. Because today unicode characters are 4 byte in the Linux
world, there has been some effort to add 2 byte unicode strings to gcc
2.95.2 and to implement the appropriate string handling routines. We
would appreciate if gcc and glibc would be enhanced with our feature
proposal. Of course, we are willing to contribute towards the
implementation of 2 byte Unicode. We already have a patch for gcc to
provide some 2 byte Unicode support.
Reasons for Using UTF-16 and UTF-32
What is UTF-16?
UTF-16 allows access to 63K characters as single Unicode 16-bit
units. It can access an additional 1M characters by a mechanism known as
surrogate pairs. Two ranges of Unicode code values are reserved for the
high (first) and low (second) values of these pairs. Highs are from
0xD800 to 0xDBFF, and lows from 0xDC00 to 0xDFFF. In Unicode 3.0, there
are no assigned surrogate pairs. Since the most common characters have
already been encoded in the first 64K values, the characters requiring
surrogate pairs will be relatively rare. (Taken from Unicode FAQ
Copyright ©1998-2000 Unicode, Inc.)
What is UTF-32?
All characters represented in UTF-16, both those represented with 16
bits and those with a surrogate pair, can be represented as a single
32-bit unit in UTF-32. This single unit corresponds to the Unicode
scalar value, which is the abstract number associated with a Unicode
character. UTF-32 is a subset of the encoding mechanism called UCS-4 in
ISO 10646. (Taken from Unicode FAQ Copyright ©1998-2000 Unicode, Inc.)
These are reasons to use UTF-16:
1.Performance
The UTF-16 representation of textual data needs only half the
amount of memory that a 32-bit representation would need, provided
that surrogate pairs occur only seldom, which will be the
case. Memory itself may be cheap, but the size of the data that
has to be handled for each user in a multiuser environment is a
critical performance issue.
2.Portability
Software that uses wchar_t has restricted portability since
wchar_t sometimes has 32 bits, but sometimes only 16 bits. A
dedicated type for Unicode with platform-independent length allows
to write portable software.
3.Interplatform Communication
When UTF-16 is used for communication between different platforms,
merely an Endian conversion may be necessary, which can be done in
place. A conversion between UTF-16 and UTF-32 is more costly.
UTF-32 would imply unacceptably high data volumes when used for
communication.
4.Embedding in existing IT infrastructures
UTF 16 Unicode implementations integrate better in existing IT
landscapes. Here we find products which use 16 byte Unicode,
too. One example for this is the Java Native Code Interface.
Using this JNCI, it is possible to access the UTF-16
representation by C functions. UTF-16 support in C is therefore
desirable. (Furthermore, in JDK 1.2, classes such as
java.io.OutputStreamWriter and java.io.InputStreamReader support
the conversion of surrogate pairs to single UTF-8 characters and
back.)
5.Other commercial software uses 16-bit Unicode
The Oracle Call Interface (OCI) supports 16-bit Unicode (see
Oracle8i National Language Support Guide, Release 8.1.5 and
higher); in the data base, UTF-8 is used. PeopleSoft PeopleTools 8
(C/C++ core) uses 16-bit Unicode, independently of the
platform. Also, some office products support 16-bit Unicode.
6.Operations and representation of character strings
Although UTF-32 makes some operations on characters easier
(e.g. indexing into strings) this implementation leads to a great
overhead in other areas (see searching, collating, displaying etc.
where the whole string is involved). The final point is that even
UTF-16 without surrogates is already capable to support the most
important scripts like Arabic, Bengali etc.
For a full - and more technical - proposal please have a look at the
attachment.
Hoping for a fruitful discussion ;-)
Kind regards
Willi Nuesser
SAP LinuxLab
PS: When UTF-8 is used, the complexity of variable width characters
shows up with almost every commonly used language except pure 7-Bit
ASCII. For a number of languages, the UTF-8 representation saves some
storage when compared with UTF-16, but for Asian characters UTF-8
requires 50% more storage than UTF-16. We do not consider UTF-8 as
advantageous for text representation in the memory. It may be well
suited for files where access is sequential but in general it is no
uni-versal solution.
Appendix:
detailed proposal for 2 byte Unicode implementation
<<appendix.txt>>
Data types and library functions for UTF-16 and UTF-32
1.Data types for UTF-16 and UTF-32
We suggest the names utf16_t and utf32_t. Technically these types
are unsigned integers of 16 and 32 bits length, respectively.
Depending on the platform, either utf16_t or utf32_t coincides
with wchar_t. In the C++ Standard (ISO/IEC 14882:1998) wchar_t is
a keyword. For pure C usage it is sufficient if utf16_t and
utf32_t are defined by typedef, but in function and operator
overloading it would be impossible to distinguish pointers to
these new types from pointers to the corresponding integer type.
So it is desirable to have utf16_t and utf32_t as keywords,
possibly depending on a compiler switch for compatibility to the
standard.
2.String and character literals
For utf16_t literals, we suggest the prefix u (similar to the
prefix L for the type wchar_t):
utf16_t s[] = u"someText";
utf16_t c = u's';
For utf32_t, we suggest the prefix U. This is similar to the
notation for universal character names in the C++ Standard: \u is
followed by four hexadecimal digits and \U is followed by eight
hexadecimal digits.
In C++, u's' will be of the type utf16_t, and U's' will be of the
type utf32_t. In C, character literals are (usually) of signed
integral type. For u's' and U's' we propose to introduce the types
utf16int_t and utf32int_t, respectively.
3.Runtime library
The C standard (ISO/IEC 9899:1990) and its Addendum 1 (1994)
specify a set of functions for the type wchar_t. Most of the
functions are declared in the header file wchar.h. The names of
the functions are similar to the names of the original functions
for type char, but with 'str' replaced by 'wcs' or a 'w' added in
the name.
These functions can also be found in "The Single UNIX
Specification, Version 2", Copyright 1997 by The Open Group, and
in the C++ Standard (ISO/IEC 14882:1998, Section 21.4).
We suggest to implement these functions for the type utf16_t,
using the suffix 'U16' in the name, and for utf32_t using the
suffix 'U32'. For conversion functions, there is an exception
from this rule: We replace 'wc' by 'U16' or 'U32',
respectively. This also allows to introduce U16stoU32s etc.
1.Simple string handling
utf16_t *strcatU16 (utf16_t *, const utf16_t *);
utf16_t *strncatU16 (utf16_t *, const utf16_t *, size_t);
utf16_t *strcpyU16 (utf16_t *, const utf16_t *);
utf16_t *strncpyU16 (utf16_t *, const utf16_t *, size_t);
utf16_t *strdupU16 (const utf16_t *);
utf16_t *strchrU16 (const utf16_t *, utf16_t);
utf16_t *strrchrU16 (const utf16_t *, utf16_t);
size_t strlenU16 (const utf16_t *);
size_t strspnU16 (const utf16_t *, const utf16_t *);
size_t strcspnU16 (const utf16_t *, const utf16_t *);
int strcmpU16 (const utf16_t *, const utf16_t *);
int strncmpU16 (const utf16_t *,
const utf16_t *, size_t);
utf16_t *strpbrkU16 (const utf16_t *, const utf16_t *);
utf16_t *strstrU16 (const utf16_t *, const utf16_t *);
utf16_t *strtokU16 (utf16_t *, const utf16_t *,
utf16_t **);
utf32_t *strcatU32 (utf32_t *, const utf32_t *);
utf32_t *strncatU32 (utf32_t *, const utf32_t *, size_t);
utf32_t *strcpyU32 (utf32_t *, const utf32_t *);
utf32_t *strncpyU32 (utf32_t *, const utf32_t *, size_t);
utf32_t *strdupU32 (const utf32_t *);
utf32_t *strchrU32 (const utf32_t *, utf32_t);
utf32_t *strrchrU32 (const utf32_t *, utf32_t);
size_t strlenU32 (const utf32_t *);
size_t strspnU32 (const utf32_t *, const utf32_t *);
size_t strcspnU32 (const utf32_t *, const utf32_t *);
int strcmpU32 (const utf32_t *, const utf32_t *);
int strncmpU32 (const utf32_t *, const utf32_t *,
size_t);
utf32_t *strpbrkU32 (const utf32_t *, const utf32_t *);
utf32_t *strstrU32 (const utf32_t *, const utf32_t *);
utf32_t *strtokU32 (utf32_t *, const utf32_t *,
utf32_t **);
2.Memory operations
utf16_t *memchrU16 (const utf16_t *s, utf16_t uc,
size_t len);
int memcmpU16 (const utf16_t *s1, const utf16_t *s2,
size_t len);
utf16_t *memcpyU16 (utf16_t *s1, const utf16_t *s2,
size_t len);
utf16_t *memmoveU16 (utf16_t *s1, const utf16_t *s2,
size_t len);
utf16_t *memsetU16 (utf16_t *s, utf16_t uc,
size_t len);
utf32_t *memchrU32 (const utf32_t *s, utf32_t uc,
size_t len);
int memcmpU32 (const utf32_t *s1, const utf32_t *s2,
size_t len);
utf32_t *memcpyU32 (utf32_t *s1, const utf32_t *s2,
size_t len);
utf32_t *memmoveU32 (utf32_t *s1, const utf32_t *s2,
size_t len);
utf32_t *memsetU32 (utf32_t *s, utf32_t uc,
size_t len);
3.Character classification
int isalnumU16 (utf16int_t);
int isalphaU16 (utf16int_t);
int iscntrlU16 (utf16int_t);
int isdigitU16 (utf16int_t);
int isgraphU16 (utf16int_t);
int islowerU16 (utf16int_t);
int isprintU16 (utf16int_t);
int ispunctU16 (utf16int_t);
int isspaceU16 (utf16int_t);
int isupperU16 (utf16int_t);
int isxdigitU16(utf16int_t);
U16type_t U16type (const char *);
int isU16type (utf16int_t, U16type_t);
int isalnumU32 (utf32int_t);
int isalphaU32 (utf32int_t);
int iscntrlU32 (utf32int_t);
int isdigitU32 (utf32int_t);
int isgraphU32 (utf32int_t);
int islowerU32 (utf32int_t);
int isprintU32 (utf32int_t);
int ispunctU32 (utf32int_t);
int isspaceU32 (utf32int_t);
int isupperU32 (utf32int_t);
int isxdigitU32(utf32int_t);
U32type_t U32type (const char *);
int isU32type (utf32int_t, U32type_t);
4.Case conversion, case-insensitive comparison
utf16int_t toupperU16 (utf16int_t);
utf16int_t tolowerU16 (utf16int_t);
int strcasecmpU16 (const utf16_t *, const utf16_t *);
int strncasecmpU16(const utf16_t *, const utf16_t *,
size_t n);
utf32int_t toupperU32 (utf32int_t);
utf32int_t tolowerU32 (utf32int_t);
int strcasecmpU32 (const utf32_t *, const utf32_t *);
int strncasecmpU32(const utf32_t *, const utf32_t *,
size_t n);
5.Collation
int strcollU16 (const utf16_t *, const utf16_t *);
size_t strxfrmU16 (utf16_t *, const utf16_t *, size_t);
int strcollU32 (const utf32_t *, const utf32_t *);
size_t strxfrmU32 (utf32_t *, const utf32_t *, size_t);
6.Conversions between different representations
int U16len (const utf16_t *, size_t);
int mbtoU16 (utf16_t *, const char *, size_t);
size_t mbrtoU16 (utf16_t *, const char *, size_t,
mbstate_t *);
int U16tomb (char *, const utf16_t *, size_t);
size_t U16rtomb (char *, const utf16_t *, size_t,
mbstate_t *);
size_t mbstoU16s (utf16_t *, const char *, size_t);
size_t mbsrtoU16s (utf16_t *, const char **, size_t,
mbstate_t *);
size_t mbsnrtoU16s(utf16_t *, const char **, size_t, size_t,
mbstate_t *);
size_t U16stombs (char *, const utf16_t *, size_t);
size_t U16srtombs (char *, const utf16_t **, size_t,
mbstate_t *);
size_t U16snrtombs(char *, const utf16_t **, size_t, size_t,
mbstate_t *);
int mbtoU32 (utf32_t *, const char *, size_t);
size_t mbrtoU32 (utf32_t *, const char *, size_t,
mbstate_t *);
int U32tomb (char *, utf32_t, size_t);
size_t U32rtomb (char *, utf32_t, size_t, mbstate_t *);
size_t mbstoU32s (utf32_t *, const char *, size_t);
size_t mbsrtoU32s (utf32_t *, const char **, size_t,
mbstate_t *);
size_t mbsnrtoU32s(utf32_t *, const char **, size_t,
size_t, mbstate_t *);
size_t U32stombs (char *, const utf32_t *, size_t);
size_t U32srtombs (char *, const utf32_t **, size_t,
mbstate_t *);
size_t U32snrtombs(char *, const utf32_t **, size_t, size_t,
mbstate_t *);
int U32toU16 (utf16_t *, utf32_t);
int U16toU32 (utf32_t *, const utf16_t *, size_t);
size_t U32stoU16s (utf16_t *, const utf32_t *, size_t);
size_t U32sntoU16s(utf16_t *, const utf32_t **, size_t,
size_t);
size_t U16stoU32s (utf32_t *, const utf16_t *, size_t);
size_t U16sntoU32s(utf32_t *, const utf16_t **, size_t,
size_t);
7.Conversion to numeric types
double strtodU16 (const utf16_t *, utf16_t **);
long int strtolU16 (const utf16_t *, utf16_t **, int);
unsigned long int strtoulU16 (const utf16_t *, utf16_t **, int);
long long int strtollU16 (const utf16_t *, utf16_t **, int);
unsigned long long int strtoullU16 (const utf16_t *, utf16_t **,
int);
double strtodU32 (const utf32_t *, utf32_t **);
long int strtolU32 (const utf32_t *, utf32_t **, int);
unsigned long int strtoulU32 (const utf32_t *, utf32_t **, int);
long long int strtollU32 (const utf32_t *, utf32_t **, int);
unsigned long long int strtoullU32 (const utf32_t *, utf32_t **,
int);
8.Standard I/O
int printfU16 (const utf16_t *uformat, ...);
int fprintfU16 (FILE *stream, const utf16_t *uformat, ...);
int sprintfU16 (utf16_t *str, const utf16_t *uformat, ...);
int vprintfU16 (const utf16_t *uformat, va_list ap);
int vfprintfU16 (FILE *stream, const utf16_t *uformat,
va_list ap);
int vsprintfU16 (utf16_t *str, const utf16_t *uformat,
va_list ap);
int scanfU16 (const utf16_t *uformat, ...);
int fscanfU16 (FILE *stream, const utf16_t *uformat, ...);
int sscanfU16 (const utf16_t *str,
const utf16_t *uformat, ...);
utf16int_t fgetcU16 (FILE *stream);
utf16int_t getcU16 (FILE *stream);
utf16int_t getcharU16 (void);
utf16int_t ungetcU16 (utf16int_t c, FILE *stream);
utf16int_t fputcU16 (utf16int_t c, FILE *stream);
utf16int_t putcU16 (utf16int_t c, FILE *stream);
utf16int_t putcharU16 (utf16int_t c);
utf16_t *fgetsU16 (utf16_t *str, int n, FILE *stream);
int fputsU16 (const utf16_t *str, FILE *stream);
int printfU32 (const utf32_t *uformat, ...);
int fprintfU32 (FILE *stream, const utf32_t *uformat, ...);
int sprintfU32 (utf32_t *str, const utf32_t *uformat, ...);
int vprintfU32 (const utf32_t *uformat, va_list ap);
int vfprintfU32 (FILE *stream, const utf32_t *uformat,
va_list ap);
int vsprintfU32 (utf32_t *str, const utf32_t *uformat,
va_list ap);
int scanfU32 (const utf32_t *uformat, ...);
int fscanfU32 (FILE *stream, const utf32_t *uformat, ...);
int sscanfU32 (const utf32_t *str, const utf32_t *uformat,
...);
utf32int_t fgetcU32 (FILE *stream);
utf32int_t getcU32 (FILE *stream);
utf32int_t getcharU32 (void);
utf32int_t ungetcU32 (utf32int_t c, FILE *stream);
utf32int_t fputcU32 (utf32int_t c, FILE *stream);
utf32int_t putcU32 (utf32int_t c, FILE *stream);
utf32int_t putcharU32 (utf32int_t c);
utf32_t *fgetsU32 (utf32_t *str, int n, FILE *stream);
int fputsU32 (const utf32_t *str, FILE *stream);
9.Others
int U16swidth (const utf16_t *, size_t);
int U16width (utf16int_t);
size_t strftimeU16(utf16_t *str, size_t len,
const utf16_t *format,
const struct tm *tmdate);
int U32swidth (const utf32_t *, size_t);
int U32width (utf32int_t);
size_t strftimeU32(utf32_t *str, size_t len,
const utf32_t *format,
const struct tm *tmdate);
4.Remarks
The function
int U16len(const utf16_t *s, size_t n);
determines the number of the utf16_t units that constitute the
character pointed to by s if s is not a null pointer. At most n
units of the array pointed to by s will be examined.
The function
int U32toU16(utf16_t *s, utf32_t wc);
determines the number of utf16_t units needed to represent the
character wc. It stores the utf16_t representation in the array
pointed to by s (if s is not a null pointer).
The function
int U16toU32(utf32_t *pwc, const utf16_t s*, size_t n);
determines the number of the utf16_t units that constitute the
character pointed to by s if s is not a null pointer. At most n
units of the array pointed to by s will be examined. The
corresponding utf32_t representation is stored in the object
pointed to by pwc.