McEs, A Hacker Life: Static Unicode to Utf-8 Converter!

Wednesday, July 28, 2004

Static Unicode to Utf-8 Converter!

I was frustrated 5 in the morning and couldn't sleep. This is how I looked like:

So I decided to do something exciting. I've seen all over the place in Pango and other libraries that use UTF-8 for their internal encoding, that they have to hardcode the UTF-8 value of a character. For example from Pango sources:

  /* First try using a specific ellipsis character in the best matching font
   */
  if (state->ellipsis_is_cjk)
    ellipsis_text = "\342\213\257";     /* U+22EF: MIDLINE HORIZONTAL ELLIPSIS, used for CJK */
  else
    ellipsis_text = "\342\200\246";     /* U+2026: HORIZONTAL ELLIPSIS */

So I decided to write down a macro to do the conversion. Such that you give it 0x22EF and it gives back "\342\213\257". Since nobody has done that before, I knew that's not going to be easy. Mastering the C preprocessor these days, I knew that it may be impossible...

I started thinking. First: The only way to create a string in preprocessor is using the STRINGIZE operator #x, which is not any helpful here. So I decided that what I'm going to form is an array. After playing a bit I found that (char *) {0342, 0213, 0257, 0} is almost as good as what I want, and if not anywhere else, at least it can be used to initialize string buffers. Then I can say:

  char ellipsis_text[] = UNICODE_TO_UTF8(0x22EF);

That's pretty much what I did, except that you can't put this initializer inside paranthesis, as it's not an expression. So you cannot use (x?y:z), means, no control on the size of the initilizer. So I had to stick with 7 octets. Here is what came out:

#ifndef _STATIC_UTF8_LONG_H
#define _STATIC_UTF8_LONG_H

#define UNICHAR_TO_UTF8(Char)                                                 \
  (const char [])                                                             \
    {                                                                         \
      /* first octet */                                                       \
      (Char) < 0x00000080 ?   (Char)                       :                  \
      (Char) < 0x00000800 ?  ((Char) >>  6)         | 0xC0 :                  \
      (Char) < 0x00010000 ?  ((Char) >> 12)         | 0xE0 :                  \
      (Char) < 0x00200000 ?  ((Char) >> 18)         | 0xF0 :                  \
      (Char) < 0x04000000 ?  ((Char) >> 24)         | 0xF8 :                  \
                             ((Char) >> 30)         | 0xFC,                   \
      /* second octet */                                                      \
      (Char) < 0x00000080 ?    0 /* null-terminator */     :                  \
      (Char) < 0x00000800 ?  ((Char)        & 0x3F) | 0x80 :                  \
      (Char) < 0x00010000 ? (((Char) >>  6) & 0x3F) | 0x80 :                  \
      (Char) < 0x00200000 ? (((Char) >> 12) & 0x3F) | 0x80 :                  \
      (Char) < 0x04000000 ? (((Char) >> 18) & 0x3F) | 0x80 :                  \
                            (((Char) >> 24) & 0x3F) | 0x80,                   \
      /* third octet */                                                       \
      (Char) < 0x00000800 ?    0 /* null-terminator */     :                  \
      (Char) < 0x00010000 ?  ((Char)        & 0x3F) | 0x80 :                  \
      (Char) < 0x00200000 ? (((Char) >>  6) & 0x3F) | 0x80 :                  \
      (Char) < 0x04000000 ? (((Char) >> 12) & 0x3F) | 0x80 :                  \
                            (((Char) >> 18) & 0x3F) | 0x80,                   \
      /* fourth octet */                                                      \
      (Char) < 0x00010000 ?    0 /* null-terminator */     :                  \
      (Char) < 0x00200000 ?  ((Char)        & 0x3F) | 0x80 :                  \
      (Char) < 0x04000000 ? (((Char) >>  6) & 0x3F) | 0x80 :                  \
                            (((Char) >> 12) & 0x3F) | 0x80,                   \
      /* fifth octet */                                                       \
      (Char) < 0x00200000 ?    0 /* null-terminator */     :                  \
      (Char) < 0x04000000 ?  ((Char)        & 0x3F) | 0x80 :                  \
                            (((Char) >>  6) & 0x3F) | 0x80,                   \
      /* sixth octet */                                                       \
      (Char) < 0x04000000 ?    0 /* null-terminator */     :                  \
                             ((Char)        & 0x3F) | 0x80,                   \
                               0 /* null-terminator */                        \
    }

#endif /* !_STATIC_UTF8_LONG_H */

and the code to use it:

#include <stdio.h>
#include "static-utf8-long.h"

int
main()
{
  printf ("%s\n", UNICHAR_TO_UTF8 (0x06CC));

  return 0;
}

But as you should have guessed (leave here otherwise ;), this way every single UTF-8 character consumes exactly 7 bytes, which is way a lot. So I was not satisfied. I tried and checked out the assembly output of gcc under different optimization options, and no wonder none of them kicked the trailing zero bytes out. So I needed to continue. Good, it was only 6 by now. For sure I needed to use preprocessor conditionals. But then, in preprocessor conditionals, you can only use preprocessor symbols. The rest is obvious now:

#ifndef Char
# error Char undefined
#else
  (const char [])
    {
#if    Char  >= 0x00000000
      /* first octet */
      (Char) < 0x00000080 ?   (Char)                       :
      (Char) < 0x00000800 ?  ((Char) >>  6)         | 0xC0 :
      (Char) < 0x00010000 ?  ((Char) >> 12)         | 0xE0 :
      (Char) < 0x00200000 ?  ((Char) >> 18)         | 0xF0 :
      (Char) < 0x04000000 ?  ((Char) >> 24)         | 0xF8 :
                             ((Char) >> 30)         | 0xFC,
#endif
#if    Char >= 0x00000080
      /* second octet */
      (Char) < 0x00000800 ?  ((Char)        & 0x3F) | 0x80 :
      (Char) < 0x00010000 ? (((Char) >>  6) & 0x3F) | 0x80 :
      (Char) < 0x00200000 ? (((Char) >> 12) & 0x3F) | 0x80 :
      (Char) < 0x04000000 ? (((Char) >> 18) & 0x3F) | 0x80 :
                            (((Char) >> 24) & 0x3F) | 0x80,
#endif
#if    Char >= 0x00000800
      /* third octet */
      (Char) < 0x00010000 ?  ((Char)        & 0x3F) | 0x80 :
      (Char) < 0x00200000 ? (((Char) >>  6) & 0x3F) | 0x80 :
      (Char) < 0x04000000 ? (((Char) >> 12) & 0x3F) | 0x80 :
                            (((Char) >> 18) & 0x3F) | 0x80,
#endif
#if    Char >= 0x00010000
      /* fourth octet */
      (Char) < 0x00200000 ?  ((Char)        & 0x3F) | 0x80 :
      (Char) < 0x04000000 ? (((Char) >>  6) & 0x3F) | 0x80 :
                            (((Char) >> 12) & 0x3F) | 0x80,
#endif
#if    Char >= 0x00200000
      /* fifth octet */
      (Char) < 0x04000000 ?  ((Char)        & 0x3F) | 0x80 :
                            (((Char) >>  6) & 0x3F) | 0x80,
#endif
#if    Char >= 0x04000000
      /* sixth octet */
                             ((Char)        & 0x3F) | 0x80,
#endif
                            0 /* null-terminator */
    }
#undef Char
#endif

and the code to use it:

#include <stdio.h>

int
main()
{
  printf ("%s\n",
#           define Char 0x06CC
#           include "static-utf8-short.h"
         );

  return 0;
}

I'm not quite sure that I can find anybody to actually use this. At least I'm sure it will not show up in Glib. Yet another reason to start my own Unicode library ;-). By the way, lemme know if you have a better (read: real) solution.

[I said that we need a gallery of codecs on UTF-8 Project.]

¶ 6:35 AM

Comments:

This comment has been removed by a blog administrator.

# posted by

Anonymous : July 28, 2004 12:44 PM

Someone suggested that an inline function would do that and gcc optimizes enough to get what I want. I'm afraid it's not the case. The only way it can be done in a function is to malloc() the space and fill it byte by byte. You're not saying that the malloc()ed memory is going to be optimized away, are you?

# posted by

behdad : July 29, 2004 9:53 PM

I can't get (const char *){0342, 0213, 0257, 0} to work as an initializer in my gcc. The compilter apparently casts the first integer in the array to a pointer and gives a few warnings. I think you meant (const char []){...} which is also what you use in the code.

I tried using (const char*)(const char[]){...} and it works fine in my gcc. It can be used in an expression, allowing you to use a different array length for each unicode character. The only drawback is that it can't be used to initialize an array. The user has to strcpy it instead.

# posted by

Anonymous : April 29, 2005 5:00 AM

About Me

Twitter Updates