McEs, A Hacker Life: False alarm on g_utf8_offset_to

Thursday, November 03, 2005

False alarm on g_utf8_offset_to_pointer

My investigations suggest that the current code is indeed the best to have in glib. Imagary first:

glib is the original glib code that jumps over each character using a lookup table. As you see, the performance is kinda linear on the number of characters.
pvanhoof is this post, checking for *s < 192 before looking up in the table
luis is this implementation, it goes over all bytes, checking which ones are start of character. This doesn't work good for multibyte characters at all. As you can see, it's exactly three times slower than the original code for Korean.
hpj is this implementation, using a very custom logic working a 32-bit word at a time. Again, it's pretty slow on Korean, since it's linear in the number of bytes, not characters.
behdad is my implementation below. It works a word at a time, using bit hacking to count the number of start-of-char bytes. It's faster than the other proposed patches for most cases, but still slower than the original code. BUT, since it works a word at a time, I expect it to beat the original code on 64-bit architectures. :-) Here it is:


#define WORDTYPE guint
#define REPEAT(x) (((WORDTYPE)-1 / 0xFF) * (x))

gchar*
behdad_utf8_offset_to_pointer (const gchar *str, glong        offset)
{
const WORDTYPE *ws;
const gchar *s;;

ws = (const WORDTYPE *)str;
while (offset >= sizeof (WORDTYPE)) {
  register WORDTYPE w = *ws++;
  w |= ~(w >> 1);
  w &= REPEAT(0x40);
  w >>= 6;
  w *= REPEAT(1);
  w >>= (sizeof (WORDTYPE) - 1) * 8;
  offset -= w;
}

s = (const gchar *)ws;
while ((*(const guchar*)s)>>6==2)
  s++;
while (offset--)
  s = g_utf8_next_char (s);

return s;
}

(Update: code updated to rip off one more operation)

Surprisingly, unwrapping the loops slowed things down for all implementations! Should have something to do with pipelining. Makes some weird kind of sense to me.

The code for the benchmark is attached to this post to mailing list.

Update: As Morten shows, when dealing with words, one should align access, or it dumps core on Sparc and other weird architectures.

¶ 11:41 AM

Comments:

very interesting experiments :)

I'd checked out and tried all patches, and I know that glib's implementation is *magical*.

BTW, I didn't understand your code yet ;)

# posted by

Anonymous : November 03, 2005 4:39 PM

Here is how my algorithm works. We have a word w, we want to count the number of bytes in it that their top two bits is not 10. To do that, I OR w with negated copy of it shifted one bit to the right. So now, the 7th bit of each byte is one if and only if the top two bits are not 10. So I mask the word with 0x40 byte repeated, to get the 7th bit of each byte remaining.

All I now want is to count the number of those 7th bits set. First shift 6 to right, we have the special bit as the first bit now. Here goes the trick, if you multiply the word by a word with 0x01 byte repeated (0x01010101 for 32-bit words), you get the desired count in the high byte. We are a shift away from the answer :).

# posted by

behdad : November 03, 2005 6:00 PM

I should've checked the comments for an explanation first (I saw your post on planet gnome), that would have saved me 10 minutes ;)

You taught me two tricks: 1) the REPEAT() trick (very nice, I'm so going to steal that) and 2) the w |= ~(w>>1) trick to set a bit iff it and its more significant neighbour were not 10. Nice.

I think I know how to pull the multiplications out of the inner loop in a nice way -- something like this (just a rough sketch):

...
while (offset >= sizeof(WORDTYPE)) {
int iter;
WORDTYPE sum = 0;

iter = min(255, offset/sizeof(WORDTYPE));

while (iter--) {
register WORDTYPE w = *ws++;
w |= ~(w>>1);
w &= REPEAT(0x40);
sum += w;
}
while (sum) {
offset -= (sum & 0xFF);
sum >>= 8;
}
}

...

Rationale:
Multiplications are quite a bit slower than shifts and additions + this avoids the data dependency in your loop: offset >= sizeof(WORDTYPE) depends on offset -= w which depends on all the stuff inside the loop. It is probably better to calculate a safe upper bound on the number of iterations in the inner loop in advance and then have that loop run at full tilt.

I'll probably have a go at coding it up... but don't feel you have to wait for me ;)

---

PS: Neither the code nor the pre tag is accepted -- does anybody know of an alternative for keeping code formating unbungled?

# posted by

Anonymous : November 04, 2005 2:54 AM

Ah, well... the experiment didn't pan out as well as I'd hoped.

For one, gcc in its infinite wisdom, turned my inner while (iter--) {...} loop into something like for (tmp=0; tmp < iter; tmp++) { ... }, where tmp is register allocated and iter is not. Brilliant loop conversion on a register-starved machine :(

(yes, I know you can fomit the frame pointer)

# posted by

Anonymous : November 04, 2005 9:48 AM

Thanks Peter, nice observation.

# posted by

behdad : November 04, 2005 2:47 PM

Thanks.

It turns out the multiplication becomes a bunch of shifts + add + lea, most of which are slower on a P4 than on anything else. This could explain your measurements being different from hpj's (see http://hpj.blognaco.com/2005/11/03/or-is-it/).

On my slow 450 MHz PIII laptop, your two versions are slower than anything else, except for Spanish/Finnish/Danish where they are slightly faster than the glib version.

Add the twist I suggested above and you have a function that is significantly faster for the mostly latin-1 languages. It is faster than the "untwisted" behdad versions on zh_TW/ko/ar but the slowdown compared to glib is still too big. Add a shortcut if offset < n * sizeof(WORDTYPE) and you reduce the slowdown to something that I think might be sufferable -- without losing much of the speedup on es/da/fi or even el (with an n of 4-6).

Only tested with gcc 3.4 on ubuntu breezy badger on the 450 MHz machine (for now).

---
Oh, btw, the silly loop conversion that I struggled with earlier can be worked around by making iter an unsigned char instead of an int.

I'll probably send some code + some timing charts tomorrow...

# posted by

Anonymous : November 04, 2005 6:42 PM

Are you still playing with this? I tried the following snippet whith your benchmark, on average it was an improvement. (athlon-xp 1900+, gcc 4.03, -O3 --march=atholn-xp)

Code follows, but indenting is all screwed:

#define john_g_utf8_skip(i) (gchar)((i&128) ? (i&64) ? (i&32) ? (i&16) ? (i&8) ? (i&4) ? (i&2) ? 1\
: 6\
: 5\
: 4\
: 3\
: 2\
: 1\
: 1)

#define john_g_utf8_next_char(p) (char *)((p) + john_one_g_utf8_skip(*(guchar *)(p)))

# posted by

Anonymous : November 27, 2005 1:04 AM

About Me

Twitter Updates