Behdad Esfahbod's daily notes on GNOME, Pango, Fedora, Persian Computing, Bob Dylan, and Dan Bern!

My Photo
Name:
Location: Toronto, Ontario, Canada

Ask Google.

Contact info
Google
Hacker Emblem Become a Friend of GNOME I Power Blogger
follow me on Twitter
Archives
July 2003
August 2003
October 2003
November 2003
December 2003
March 2004
April 2004
May 2004
July 2004
August 2004
September 2004
November 2004
March 2005
April 2005
May 2005
June 2005
July 2005
August 2005
September 2005
October 2005
November 2005
December 2005
January 2006
February 2006
March 2006
April 2006
May 2006
June 2006
July 2006
August 2006
September 2006
October 2006
November 2006
December 2006
January 2007
February 2007
March 2007
April 2007
May 2007
June 2007
July 2007
August 2007
September 2007
October 2007
November 2007
December 2007
January 2008
February 2008
March 2008
April 2008
May 2008
June 2008
July 2008
August 2008
October 2008
November 2008
December 2008
January 2009
March 2009
April 2009
May 2009
June 2009
July 2009
August 2009
November 2009
December 2009
March 2010
April 2010
May 2010
June 2010
July 2010
October 2010
November 2010
April 2011
May 2011
August 2011
September 2011
October 2011
November 2011
November 2012
June 2013
January 2014
May 2015
Current Posts
McEs, A Hacker Life
Wednesday, July 28, 2004
 Static Unicode to Utf-8 Converter!

I was frustrated 5 in the morning and couldn't sleep. This is how I looked like:

So I decided to do something exciting. I've seen all over the place in Pango and other libraries that use UTF-8 for their internal encoding, that they have to hardcode the UTF-8 value of a character. For example from Pango sources:

  /* First try using a specific ellipsis character in the best matching font
*/

if (state->ellipsis_is_cjk)
ellipsis_text = "\342\213\257"; /* U+22EF: MIDLINE HORIZONTAL ELLIPSIS, used for CJK */
else
ellipsis_text = "\342\200\246"; /* U+2026: HORIZONTAL ELLIPSIS */

So I decided to write down a macro to do the conversion. Such that you give it 0x22EF and it gives back "\342\213\257". Since nobody has done that before, I knew that's not going to be easy. Mastering the C preprocessor these days, I knew that it may be impossible...

I started thinking. First: The only way to create a string in preprocessor is using the STRINGIZE operator #x, which is not any helpful here. So I decided that what I'm going to form is an array. After playing a bit I found that (char *) {0342, 0213, 0257, 0} is almost as good as what I want, and if not anywhere else, at least it can be used to initialize string buffers. Then I can say:
  char ellipsis_text[] = UNICODE_TO_UTF8(0x22EF);

That's pretty much what I did, except that you can't put this initializer inside paranthesis, as it's not an expression. So you cannot use (x?y:z), means, no control on the size of the initilizer. So I had to stick with 7 octets. Here is what came out:
#ifndef _STATIC_UTF8_LONG_H
#define _STATIC_UTF8_LONG_H

#define UNICHAR_TO_UTF8(Char) \
(const char []) \
{ \
/* first octet */ \
(Char) < 0x00000080 ? (Char) : \
(Char) < 0x00000800 ? ((Char) >> 6) | 0xC0 : \
(Char) < 0x00010000 ? ((Char) >> 12) | 0xE0 : \
(Char) < 0x00200000 ? ((Char) >> 18) | 0xF0 : \
(Char) < 0x04000000 ? ((Char) >> 24) | 0xF8 : \
((Char) >> 30) | 0xFC, \
/* second octet */ \
(Char) < 0x00000080 ? 0 /* null-terminator */ : \
(Char) < 0x00000800 ? ((Char) & 0x3F) | 0x80 : \
(Char) < 0x00010000 ? (((Char) >> 6) & 0x3F) | 0x80 : \
(Char) < 0x00200000 ? (((Char) >> 12) & 0x3F) | 0x80 : \
(Char) < 0x04000000 ? (((Char) >> 18) & 0x3F) | 0x80 : \
(((Char) >> 24) & 0x3F) | 0x80, \
/* third octet */ \
(Char) < 0x00000800 ? 0 /* null-terminator */ : \
(Char) < 0x00010000 ? ((Char) & 0x3F) | 0x80 : \
(Char) < 0x00200000 ? (((Char) >> 6) & 0x3F) | 0x80 : \
(Char) < 0x04000000 ? (((Char) >> 12) & 0x3F) | 0x80 : \
(((Char) >> 18) & 0x3F) | 0x80, \
/* fourth octet */ \
(Char) < 0x00010000 ? 0 /* null-terminator */ : \
(Char) < 0x00200000 ? ((Char) & 0x3F) | 0x80 : \
(Char) < 0x04000000 ? (((Char) >> 6) & 0x3F) | 0x80 : \
(((Char) >> 12) & 0x3F) | 0x80, \
/* fifth octet */ \
(Char) < 0x00200000 ? 0 /* null-terminator */ : \
(Char) < 0x04000000 ? ((Char) & 0x3F) | 0x80 : \
(((Char) >> 6) & 0x3F) | 0x80, \
/* sixth octet */ \
(Char) < 0x04000000 ? 0 /* null-terminator */ : \
((Char) & 0x3F) | 0x80, \
0 /* null-terminator */ \
}


#endif /* !_STATIC_UTF8_LONG_H */

and the code to use it:
#include <stdio.h>
#include "static-utf8-long.h"

int
main()
{
printf ("%s\n", UNICHAR_TO_UTF8 (0x06CC));

return 0;
}


But as you should have guessed (leave here otherwise ;), this way every single UTF-8 character consumes exactly 7 bytes, which is way a lot. So I was not satisfied. I tried and checked out the assembly output of gcc under different optimization options, and no wonder none of them kicked the trailing zero bytes out. So I needed to continue. Good, it was only 6 by now. For sure I needed to use preprocessor conditionals. But then, in preprocessor conditionals, you can only use preprocessor symbols. The rest is obvious now:
#ifndef Char
# error Char undefined
#else
(const char [])
{
#if Char >= 0x00000000
/* first octet */
(Char) < 0x00000080 ? (Char) :
(Char) < 0x00000800 ? ((Char) >> 6) | 0xC0 :
(Char) < 0x00010000 ? ((Char) >> 12) | 0xE0 :
(Char) < 0x00200000 ? ((Char) >> 18) | 0xF0 :
(Char) < 0x04000000 ? ((Char) >> 24) | 0xF8 :
((Char) >> 30) | 0xFC,
#endif
#if Char >= 0x00000080
/* second octet */
(Char) < 0x00000800 ? ((Char) & 0x3F) | 0x80 :
(Char) < 0x00010000 ? (((Char) >> 6) & 0x3F) | 0x80 :
(Char) < 0x00200000 ? (((Char) >> 12) & 0x3F) | 0x80 :
(Char) < 0x04000000 ? (((Char) >> 18) & 0x3F) | 0x80 :
(((Char) >> 24) & 0x3F) | 0x80,
#endif
#if Char >= 0x00000800
/* third octet */
(Char) < 0x00010000 ? ((Char) & 0x3F) | 0x80 :
(Char) < 0x00200000 ? (((Char) >> 6) & 0x3F) | 0x80 :
(Char) < 0x04000000 ? (((Char) >> 12) & 0x3F) | 0x80 :
(((Char) >> 18) & 0x3F) | 0x80,
#endif
#if Char >= 0x00010000
/* fourth octet */
(Char) < 0x00200000 ? ((Char) & 0x3F) | 0x80 :
(Char) < 0x04000000 ? (((Char) >> 6) & 0x3F) | 0x80 :
(((Char) >> 12) & 0x3F) | 0x80,
#endif
#if Char >= 0x00200000
/* fifth octet */
(Char) < 0x04000000 ? ((Char) & 0x3F) | 0x80 :
(((Char) >> 6) & 0x3F) | 0x80,
#endif
#if Char >= 0x04000000
/* sixth octet */
((Char) & 0x3F) | 0x80,
#endif
0 /* null-terminator */
}
#undef Char
#endif

and the code to use it:
#include <stdio.h>

int
main()
{
printf ("%s\n",
# define Char 0x06CC
# include "static-utf8-short.h"
);

return 0;
}


I'm not quite sure that I can find anybody to actually use this. At least I'm sure it will not show up in Glib. Yet another reason to start my own Unicode library ;-). By the way, lemme know if you have a better (read: real) solution.

[I said that we need a gallery of codecs on UTF-8 Project.]

Comments:
This comment has been removed by a blog administrator.
 
Someone suggested that an inline function would do that and gcc optimizes enough to get what I want. I'm afraid it's not the case. The only way it can be done in a function is to malloc() the space and fill it byte by byte. You're not saying that the malloc()ed memory is going to be optimized away, are you?
 
I can't get (const char *){0342, 0213, 0257, 0} to work as an initializer in my gcc. The compilter apparently casts the first integer in the array to a pointer and gives a few warnings. I think you meant (const char []){...} which is also what you use in the code.

I tried using (const char*)(const char[]){...} and it works fine in my gcc. It can be used in an expression, allowing you to use a different array length for each unicode character. The only drawback is that it can't be used to initialize an array. The user has to strcpy it instead.
 
Post a Comment



<< Archive
<< Home