False alarm on g_utf8_offset_to_pointer
My investigations suggest that the current code is indeed the best to have in glib. Imagary first:
- glib is the original glib code that jumps over each character using a lookup table. As you see, the performance is kinda linear on the number of characters.
- pvanhoof is this post, checking for *s < 192 before looking up in the table
- luis is this implementation, it goes over all bytes, checking which ones are start of character. This doesn't work good for multibyte characters at all. As you can see, it's exactly three times slower than the original code for Korean.
- hpj is this implementation, using a very custom logic working a 32-bit word at a time. Again, it's pretty slow on Korean, since it's linear in the number of bytes, not characters.
- behdad is my implementation below. It works a word at a time, using bit hacking to count the number of start-of-char bytes. It's faster than the other proposed patches for most cases, but still slower than the original code. BUT, since it works a word at a time, I expect it to beat the original code on 64-bit architectures. :-) Here it is:
#define WORDTYPE guint
#define REPEAT(x) (((WORDTYPE)-1 / 0xFF) * (x))
gchar*
behdad_utf8_offset_to_pointer (const gchar *str, glong offset)
{
const WORDTYPE *ws;
const gchar *s;;
ws = (const WORDTYPE *)str;
while (offset >= sizeof (WORDTYPE)) {
register WORDTYPE w = *ws++;
w |= ~(w >> 1);
w &= REPEAT(0x40);
w >>= 6;
w *= REPEAT(1);
w >>= (sizeof (WORDTYPE) - 1) * 8;
offset -= w;
}
s = (const gchar *)ws;
while ((*(const guchar*)s)>>6==2)
s++;
while (offset--)
s = g_utf8_next_char (s);
return s;
}
(Update: code updated to rip off one more operation)
Surprisingly, unwrapping the loops slowed things down for all implementations! Should have something to do with pipelining. Makes some weird kind of sense to me.
The code for the benchmark is attached to this
post to mailing list.
Update: As Morten
shows, when dealing with words, one should align access, or it dumps core on Sparc and other weird architectures.