[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Proposal for 2 Byte Unicode implementation in gcc and glibc



Hi everybody

in the following we present a proposal for 2 byte Unicode support in 
gcc and libraries.
 
 
Motivation for our proposal: 
 
when comparing applications using 4 Byte Unicode running on Linux, with
similar applications using 2 Byte Unicode on other platforms
(e.g. Win...) Linux will always show the worst performance.

Even an otherwise superior OS performance can not compensate the
additional requirements in memory bandwidth, CPU, disk space etc..  One
simple example: for a typical database used in medium sized companies of
about 100 GB, we find a ratio of about 70 percent strings to 30 percent
data. The transition to 2 byte Unicode would increase the disk space to
(2*70 + 30) % = 170 %. If we change to 4 byte Unicode the same database
would increase by 310 %.


If we want Linux to become a major and globally usable platform we
strongly believe that we can not sustain this inflation, i.e. we have to
provide an additional - 2 byte - implementation of Unicode.  At least
the programmer must be free to choose which way they want their programs
to work.  Otherwise Linux will be on the wrong track.
  
 
 
Please see the following text for some detailed information and 
the attachment for our full proposal: 
 
****************************************************************************
*****************************************
 
The next version of our business applications will be offered with the
an Unicode option. The software is programmed with 2 byte unicode
characters.  Because today unicode characters are 4 byte in the Linux
world, there has been some effort to add 2 byte unicode strings to gcc
2.95.2 and to implement the appropriate string handling routines.  We
would appreciate if gcc and glibc would be enhanced with our feature
proposal.  Of course, we are willing to contribute towards the
implementation of 2 byte Unicode. We already have a patch for gcc to
provide some 2 byte Unicode support.
 
Reasons for Using UTF-16 and UTF-32 
 
What is UTF-16? 
 
UTF-16 allows access to 63K characters as single Unicode 16-bit
units. It can access an additional 1M characters by a mechanism known as
surrogate pairs. Two ranges of Unicode code values are reserved for the
high (first) and low (second) values of these pairs.  Highs are from
0xD800 to 0xDBFF, and lows from 0xDC00 to 0xDFFF. In Unicode 3.0, there
are no assigned surrogate pairs. Since the most common characters have
already been encoded in the first 64K values, the characters requiring
surrogate pairs will be relatively rare. (Taken from Unicode FAQ
Copyright ©1998-2000 Unicode, Inc.)
 
What is UTF-32? 
 
All characters represented in UTF-16, both those represented with 16
bits and those with a surrogate pair, can be represented as a single
32-bit unit in UTF-32. This single unit corresponds to the Unicode
scalar value, which is the abstract number associated with a Unicode
character. UTF-32 is a subset of the encoding mechanism called UCS-4 in
ISO 10646. (Taken from Unicode FAQ Copyright ©1998-2000 Unicode, Inc.)
 
These are reasons to use UTF-16: 
 
    1.Performance
 
      The UTF-16 representation of textual data needs only half the
      amount of memory that a 32-bit representation would need, provided
      that surrogate pairs occur only seldom, which will be the
      case. Memory itself may be cheap, but the size of the data that
      has to be handled for each user in a multiuser environment is a
      critical performance issue.

 
    2.Portability 
 
      Software that uses wchar_t has restricted portability since
      wchar_t sometimes has 32 bits, but sometimes only 16 bits. A
      dedicated type for Unicode with platform-independent length allows
      to write portable software.
 
    3.Interplatform Communication 
 
      When UTF-16 is used for communication between different platforms,
      merely an Endian conversion may be necessary, which can be done in
      place. A conversion between UTF-16 and UTF-32 is more costly.
      UTF-32 would imply unacceptably high data volumes when used for
      communication.
 
    4.Embedding in existing IT infrastructures 
      UTF 16 Unicode implementations integrate better in existing IT
      landscapes. Here we find products which use 16 byte Unicode,
      too. One example for this is the Java Native Code Interface.
      Using this JNCI, it is possible to access the UTF-16
      representation by C functions. UTF-16 support in C is therefore
      desirable. (Furthermore, in JDK 1.2, classes such as
      java.io.OutputStreamWriter and java.io.InputStreamReader support
      the conversion of surrogate pairs to single UTF-8 characters and
      back.)
 
    5.Other commercial software uses 16-bit Unicode 
 
      The Oracle Call Interface (OCI) supports 16-bit Unicode (see
      Oracle8i National Language Support Guide, Release 8.1.5 and
      higher); in the data base, UTF-8 is used. PeopleSoft PeopleTools 8
      (C/C++ core) uses 16-bit Unicode, independently of the
      platform. Also, some office products support 16-bit Unicode.
      
    6.Operations and representation of character strings 
      
      Although UTF-32 makes some operations on characters easier
      (e.g. indexing into strings) this implementation leads to a great
      overhead in other areas (see searching, collating, displaying etc.
      where the whole string is involved).  The final point is that even
      UTF-16 without surrogates is already capable to support the most
      important scripts like Arabic, Bengali etc.



For a full - and more technical - proposal please have a look at the
attachment.

Hoping for a fruitful discussion ;-)


Kind regards

Willi Nuesser
SAP LinuxLab




PS: When UTF-8 is used, the complexity of variable width characters
shows up with almost every commonly used language except pure 7-Bit
ASCII. For a number of languages, the UTF-8 representation saves some
storage when compared with UTF-16, but for Asian characters UTF-8
requires 50% more storage than UTF-16. We do not consider UTF-8 as
advantageous for text representation in the memory. It may be well
suited for files where access is sequential but in general it is no
uni-versal solution.
 
 

Appendix: 

detailed proposal for 2 byte Unicode implementation

  <<appendix.txt>> 
Data types and library functions for UTF-16 and UTF-32
 
   1.Data types for UTF-16 and UTF-32 
 
      We suggest the names utf16_t and utf32_t. Technically these types
      are unsigned integers of 16 and 32 bits length, respectively.
 
      Depending on the platform, either utf16_t or utf32_t coincides
      with wchar_t. In the C++ Standard (ISO/IEC 14882:1998) wchar_t is
      a keyword. For pure C usage it is sufficient if utf16_t and
      utf32_t are defined by typedef, but in function and operator
      overloading it would be impossible to distinguish pointers to
      these new types from pointers to the corresponding integer type.
      So it is desirable to have utf16_t and utf32_t as keywords,
      possibly depending on a compiler switch for compatibility to the
      standard.
 
   2.String and character literals 
 
      For utf16_t literals, we suggest the prefix u (similar to the
      prefix L for the type wchar_t):
 
         utf16_t s[] = u"someText"; 
         utf16_t c = u's'; 
 
      For utf32_t, we suggest the prefix U. This is similar to the
      notation for universal character names in the C++ Standard: \u is
      followed by four hexadecimal digits and \U is followed by eight
      hexadecimal digits.
 
      In C++, u's' will be of the type utf16_t, and U's' will be of the
      type utf32_t. In C, character literals are (usually) of signed
      integral type. For u's' and U's' we propose to introduce the types
      utf16int_t and utf32int_t, respectively.
 
   3.Runtime library 
 

      The C standard (ISO/IEC 9899:1990) and its Addendum 1 (1994)
      specify a set of functions for the type wchar_t. Most of the
      functions are declared in the header file wchar.h. The names of
      the functions are similar to the names of the original functions
      for type char, but with 'str' replaced by 'wcs' or a 'w' added in
      the name.
 
      These functions can also be found in "The Single UNIX
      Specification, Version 2", Copyright 1997 by The Open Group, and
      in the C++ Standard (ISO/IEC 14882:1998, Section 21.4).
 
      We suggest to implement these functions for the type utf16_t,
      using the suffix 'U16' in the name, and for utf32_t using the
      suffix 'U32'.  For conversion functions, there is an exception
      from this rule: We replace 'wc' by 'U16' or 'U32',
      respectively. This also allows to introduce U16stoU32s etc.
 
         1.Simple string handling
 
         utf16_t  *strcatU16  (utf16_t *, const utf16_t *); 
         utf16_t  *strncatU16 (utf16_t *, const utf16_t *, size_t); 
         utf16_t  *strcpyU16  (utf16_t *, const utf16_t *); 
         utf16_t  *strncpyU16 (utf16_t *, const utf16_t *, size_t); 
         utf16_t  *strdupU16  (const utf16_t *); 
         utf16_t  *strchrU16  (const utf16_t *, utf16_t); 
         utf16_t  *strrchrU16 (const utf16_t *, utf16_t); 
         size_t    strlenU16  (const utf16_t *); 
         size_t    strspnU16  (const utf16_t *, const utf16_t *); 
         size_t    strcspnU16 (const utf16_t *, const utf16_t *); 
         int       strcmpU16  (const utf16_t *, const utf16_t *); 
         int       strncmpU16 (const utf16_t *, 
                               const utf16_t *, size_t); 
         utf16_t  *strpbrkU16 (const utf16_t *, const utf16_t *); 
         utf16_t  *strstrU16  (const utf16_t *, const utf16_t *); 
         utf16_t  *strtokU16  (utf16_t *, const utf16_t *, 
                               utf16_t **); 
         utf32_t  *strcatU32  (utf32_t *, const utf32_t *); 
         utf32_t  *strncatU32 (utf32_t *, const utf32_t *, size_t); 
         utf32_t  *strcpyU32  (utf32_t *, const utf32_t *); 
         utf32_t  *strncpyU32 (utf32_t *, const utf32_t *, size_t); 
         utf32_t  *strdupU32  (const utf32_t *); 
         utf32_t  *strchrU32  (const utf32_t *, utf32_t); 
         utf32_t  *strrchrU32 (const utf32_t *, utf32_t); 
         size_t    strlenU32  (const utf32_t *); 
         size_t    strspnU32  (const utf32_t *, const utf32_t *); 
         size_t    strcspnU32 (const utf32_t *, const utf32_t *); 
         int       strcmpU32  (const utf32_t *, const utf32_t *); 
         int       strncmpU32 (const utf32_t *, const utf32_t *, 
                               size_t); 
         utf32_t  *strpbrkU32 (const utf32_t *, const utf32_t *); 
         utf32_t  *strstrU32  (const utf32_t *, const utf32_t *); 
         utf32_t  *strtokU32  (utf32_t *, const utf32_t *, 
                               utf32_t **); 
 
         2.Memory operations  
 
         utf16_t  *memchrU16  (const utf16_t *s,  utf16_t uc, 
                               size_t len); 
         int       memcmpU16  (const utf16_t *s1, const utf16_t *s2,
                               size_t len);  
         utf16_t  *memcpyU16  (utf16_t *s1,       const utf16_t *s2, 
                               size_t len); 
         utf16_t  *memmoveU16 (utf16_t *s1,       const utf16_t *s2, 
                               size_t len); 
         utf16_t  *memsetU16  (utf16_t *s,        utf16_t uc,        
                               size_t len); 
         utf32_t  *memchrU32  (const utf32_t *s,  utf32_t uc,        
                               size_t len); 
         int       memcmpU32  (const utf32_t *s1, const utf32_t *s2, 
                               size_t len);  
         utf32_t  *memcpyU32  (utf32_t *s1,       const utf32_t *s2, 
                               size_t len); 
         utf32_t  *memmoveU32 (utf32_t *s1,       const utf32_t *s2, 
                               size_t len); 
         utf32_t  *memsetU32  (utf32_t *s,        utf32_t uc,        
                               size_t len); 
       3.Character classification 
 
         int isalnumU16 (utf16int_t); 
         int isalphaU16 (utf16int_t); 
         int iscntrlU16 (utf16int_t); 
         int isdigitU16 (utf16int_t); 
         int isgraphU16 (utf16int_t); 
         int islowerU16 (utf16int_t); 
         int isprintU16 (utf16int_t); 
         int ispunctU16 (utf16int_t); 
         int isspaceU16 (utf16int_t); 
         int isupperU16 (utf16int_t); 
         int isxdigitU16(utf16int_t); 
 
         U16type_t U16type   (const char *); 
         int       isU16type (utf16int_t, U16type_t);  
         int isalnumU32 (utf32int_t); 
         int isalphaU32 (utf32int_t); 
         int iscntrlU32 (utf32int_t); 
         int isdigitU32 (utf32int_t); 
         int isgraphU32 (utf32int_t); 
         int islowerU32 (utf32int_t); 
         int isprintU32 (utf32int_t); 
         int ispunctU32 (utf32int_t); 
         int isspaceU32 (utf32int_t); 
         int isupperU32 (utf32int_t); 
         int isxdigitU32(utf32int_t); 
 
         U32type_t U32type   (const char *); 
         int       isU32type (utf32int_t, U32type_t); 
 
         4.Case conversion, case-insensitive comparison 
 
         utf16int_t toupperU16    (utf16int_t); 
         utf16int_t tolowerU16    (utf16int_t); 
         int        strcasecmpU16 (const utf16_t *, const utf16_t *); 
         int        strncasecmpU16(const utf16_t *, const utf16_t *, 
                                   size_t n); 
         utf32int_t toupperU32    (utf32int_t); 
         utf32int_t tolowerU32    (utf32int_t); 
         int        strcasecmpU32 (const utf32_t *, const utf32_t *); 
         int        strncasecmpU32(const utf32_t *, const utf32_t *, 
                                   size_t n); 
 
         5.Collation 
 
         int       strcollU16 (const utf16_t *, const utf16_t *); 
         size_t    strxfrmU16 (utf16_t *, const utf16_t *, size_t); 
         int       strcollU32 (const utf32_t *, const utf32_t *); 
         size_t    strxfrmU32 (utf32_t *, const utf32_t *, size_t); 
 
         6.Conversions between different representations 
 
         int    U16len     (const utf16_t *, size_t); 
         int    mbtoU16    (utf16_t *, const char *, size_t); 
         size_t mbrtoU16   (utf16_t *, const char *, size_t,
                            mbstate_t *); 
         int    U16tomb    (char *, const utf16_t *, size_t); 
         size_t U16rtomb   (char *, const utf16_t *, size_t, 
                            mbstate_t *); 
         size_t mbstoU16s  (utf16_t *, const char *, size_t); 
         size_t mbsrtoU16s (utf16_t *, const char **, size_t, 
                            mbstate_t *); 
         size_t mbsnrtoU16s(utf16_t *, const char **, size_t, size_t,
                            mbstate_t *); 
         size_t U16stombs  (char *, const utf16_t *, size_t);         
         size_t U16srtombs (char *, const utf16_t **, size_t, 
                            mbstate_t *); 
         size_t U16snrtombs(char *, const utf16_t **, size_t, size_t,
                            mbstate_t *); 
         int    mbtoU32    (utf32_t *, const char *, size_t); 
         size_t mbrtoU32   (utf32_t *, const char *, size_t, 
                            mbstate_t *); 
         int    U32tomb    (char *, utf32_t, size_t); 
         size_t U32rtomb   (char *, utf32_t, size_t, mbstate_t *); 
         size_t mbstoU32s  (utf32_t *, const char *, size_t); 
         size_t mbsrtoU32s (utf32_t *, const char **, size_t, 
                            mbstate_t *); 
         size_t mbsnrtoU32s(utf32_t *, const char **, size_t,  
                            size_t, mbstate_t *); 
         size_t U32stombs  (char *, const utf32_t *, size_t);         
         size_t U32srtombs (char *, const utf32_t **, size_t, 
                            mbstate_t *); 
         size_t U32snrtombs(char *, const utf32_t **, size_t, size_t, 
                            mbstate_t *); 
         int    U32toU16   (utf16_t *, utf32_t); 
         int    U16toU32   (utf32_t *, const utf16_t *, size_t); 
         size_t U32stoU16s (utf16_t *, const utf32_t *, size_t); 
         size_t U32sntoU16s(utf16_t *, const utf32_t **, size_t, 
                            size_t); 
         size_t U16stoU32s (utf32_t *, const utf16_t *, size_t);         
         size_t U16sntoU32s(utf32_t *, const utf16_t **, size_t, 
                            size_t);         
 
 
         7.Conversion to numeric types 
 
         double            strtodU16  (const utf16_t *, utf16_t **); 
         long int          strtolU16  (const utf16_t *, utf16_t **, int); 
         unsigned long int strtoulU16 (const utf16_t *, utf16_t **, int); 
         long long int     strtollU16 (const utf16_t *, utf16_t **, int); 
         unsigned long long int strtoullU16 (const utf16_t *, utf16_t **,
                                             int); 
 
         double            strtodU32  (const utf32_t *, utf32_t **); 
         long int          strtolU32  (const utf32_t *, utf32_t **, int); 
         unsigned long int strtoulU32 (const utf32_t *, utf32_t **, int); 
         long long int     strtollU32 (const utf32_t *, utf32_t **, int); 
         unsigned long long int strtoullU32 (const utf32_t *, utf32_t **,
					     int); 
 
         8.Standard I/O 
 
         int     printfU16 (const utf16_t *uformat, ...); 
         int    fprintfU16 (FILE *stream, const utf16_t *uformat, ...); 
         int    sprintfU16 (utf16_t *str, const utf16_t *uformat, ...); 
         int    vprintfU16 (const utf16_t *uformat, va_list ap); 
         int   vfprintfU16 (FILE *stream, const utf16_t *uformat, 
			    va_list ap); 
         int   vsprintfU16 (utf16_t *str, const utf16_t *uformat, 
			    va_list ap); 
              
         int      scanfU16 (const utf16_t *uformat, ...);  
         int     fscanfU16 (FILE *stream, const utf16_t *uformat, ...); 
         int     sscanfU16 (const utf16_t *str, 
                            const utf16_t *uformat, ...);  
                
         utf16int_t   fgetcU16 (FILE *stream); 
         utf16int_t    getcU16 (FILE *stream); 
         utf16int_t getcharU16 (void); 
         utf16int_t  ungetcU16 (utf16int_t c, FILE *stream); 
         utf16int_t   fputcU16 (utf16int_t c, FILE *stream); 
         utf16int_t    putcU16 (utf16int_t c, FILE *stream); 
         utf16int_t putcharU16 (utf16int_t c); 
              
         utf16_t *fgetsU16 (utf16_t *str, int n, FILE *stream); 
         int      fputsU16 (const utf16_t *str, FILE *stream); 
 
         int     printfU32 (const utf32_t *uformat, ...); 
         int    fprintfU32 (FILE *stream, const utf32_t *uformat, ...); 
         int    sprintfU32 (utf32_t *str, const utf32_t *uformat, ...); 
         int    vprintfU32 (const utf32_t *uformat, va_list ap); 
         int   vfprintfU32 (FILE *stream, const utf32_t *uformat, 
			    va_list ap); 
         int   vsprintfU32 (utf32_t *str, const utf32_t *uformat, 
			    va_list ap); 
              
         int      scanfU32 (const utf32_t *uformat, ...);  
         int     fscanfU32 (FILE *stream, const utf32_t *uformat, ...); 
         int     sscanfU32 (const utf32_t *str, const utf32_t *uformat, 
			    ...);  
                
         utf32int_t   fgetcU32 (FILE *stream); 
         utf32int_t    getcU32 (FILE *stream); 
         utf32int_t getcharU32 (void); 
         utf32int_t  ungetcU32 (utf32int_t c, FILE *stream); 
         utf32int_t   fputcU32 (utf32int_t c, FILE *stream); 
         utf32int_t    putcU32 (utf32int_t c, FILE *stream); 
         utf32int_t putcharU32 (utf32int_t c); 
              
         utf32_t *fgetsU32 (utf32_t *str, int n, FILE *stream); 
         int      fputsU32 (const utf32_t *str, FILE *stream); 

         9.Others 
 
         int    U16swidth  (const utf16_t *, size_t); 
         int    U16width   (utf16int_t); 
         size_t strftimeU16(utf16_t *str, size_t len,  
                            const utf16_t *format, 
                            const struct tm *tmdate); 
 
         int    U32swidth  (const utf32_t *, size_t); 
         int    U32width   (utf32int_t); 
         size_t strftimeU32(utf32_t *str, size_t len,  
                            const utf32_t *format, 
                            const struct tm *tmdate); 
 
   4.Remarks 
 
      The function  
 
         int U16len(const utf16_t *s, size_t n); 
 
      determines the number of the utf16_t units that constitute the
      character pointed to by s if s is not a null pointer. At most n
      units of the array pointed to by s will be examined.
 
      The function  
 
         int U32toU16(utf16_t *s, utf32_t wc); 
 
      determines the number of utf16_t units needed to represent the
      character wc. It stores the utf16_t representation in the array
      pointed to by s (if s is not a null pointer).
 
      The function  
 
         int U16toU32(utf32_t *pwc, const utf16_t s*, size_t n); 
 
      determines the number of the utf16_t units that constitute the
      character pointed to by s if s is not a null pointer. At most n
      units of the array pointed to by s will be examined. The
      corresponding utf32_t representation is stored in the object
      pointed to by pwc.