Behdad Esfahbod's daily notes on GNOME, Pango, Fedora, Persian Computing, Bob Dylan, and Dan Bern!

My Photo
Name:
Location: Toronto, Ontario, Canada

Ask Google.

Contact info
Google
Hacker Emblem Become a Friend of GNOME I Power Blogger
follow me on Twitter
Archives
July 2003
August 2003
October 2003
November 2003
December 2003
March 2004
April 2004
May 2004
July 2004
August 2004
September 2004
November 2004
March 2005
April 2005
May 2005
June 2005
July 2005
August 2005
September 2005
October 2005
November 2005
December 2005
January 2006
February 2006
March 2006
April 2006
May 2006
June 2006
July 2006
August 2006
September 2006
October 2006
November 2006
December 2006
January 2007
February 2007
March 2007
April 2007
May 2007
June 2007
July 2007
August 2007
September 2007
October 2007
November 2007
December 2007
January 2008
February 2008
March 2008
April 2008
May 2008
June 2008
July 2008
August 2008
October 2008
November 2008
December 2008
January 2009
March 2009
April 2009
May 2009
June 2009
July 2009
August 2009
November 2009
December 2009
March 2010
April 2010
May 2010
June 2010
July 2010
October 2010
November 2010
April 2011
May 2011
August 2011
September 2011
October 2011
November 2011
November 2012
June 2013
January 2014
Current Posts
McEs, A Hacker Life
Thursday, November 03, 2005
 False alarm on g_utf8_offset_to_pointer

My investigations suggest that the current code is indeed the best to have in glib. Imagary first:



#define WORDTYPE guint
#define REPEAT(x) (((WORDTYPE)-1 / 0xFF) * (x))

gchar*
behdad_utf8_offset_to_pointer (const gchar *str, glong offset)
{
const WORDTYPE *ws;
const gchar *s;;

ws = (const WORDTYPE *)str;
while (offset >= sizeof (WORDTYPE)) {
register WORDTYPE w = *ws++;
w |= ~(w >> 1);
w &= REPEAT(0x40);
w >>= 6;
w *= REPEAT(1);
w >>= (sizeof (WORDTYPE) - 1) * 8;
offset -= w;
}

s = (const gchar *)ws;
while ((*(const guchar*)s)>>6==2)
s++;
while (offset--)
s = g_utf8_next_char (s);

return s;
}
(Update: code updated to rip off one more operation)

Surprisingly, unwrapping the loops slowed things down for all implementations! Should have something to do with pipelining. Makes some weird kind of sense to me.

The code for the benchmark is attached to this post to mailing list.

Update: As Morten shows, when dealing with words, one should align access, or it dumps core on Sparc and other weird architectures.

Comments:
very interesting experiments :)

I'd checked out and tried all patches, and I know that glib's implementation is *magical*.

BTW, I didn't understand your code yet ;)
 
Here is how my algorithm works. We have a word w, we want to count the number of bytes in it that their top two bits is not 10. To do that, I OR w with negated copy of it shifted one bit to the right. So now, the 7th bit of each byte is one if and only if the top two bits are not 10. So I mask the word with 0x40 byte repeated, to get the 7th bit of each byte remaining.

All I now want is to count the number of those 7th bits set. First shift 6 to right, we have the special bit as the first bit now. Here goes the trick, if you multiply the word by a word with 0x01 byte repeated (0x01010101 for 32-bit words), you get the desired count in the high byte. We are a shift away from the answer :).
 
I should've checked the comments for an explanation first (I saw your post on planet gnome), that would have saved me 10 minutes ;)

You taught me two tricks: 1) the REPEAT() trick (very nice, I'm so going to steal that) and 2) the w |= ~(w>>1) trick to set a bit iff it and its more significant neighbour were not 10. Nice.

I think I know how to pull the multiplications out of the inner loop in a nice way -- something like this (just a rough sketch):

...
while (offset >= sizeof(WORDTYPE)) {
int iter;
WORDTYPE sum = 0;

iter = min(255, offset/sizeof(WORDTYPE));

while (iter--) {
register WORDTYPE w = *ws++;
w |= ~(w>>1);
w &= REPEAT(0x40);
sum += w;
}
while (sum) {
offset -= (sum & 0xFF);
sum >>= 8;
}
}

...

Rationale:
Multiplications are quite a bit slower than shifts and additions + this avoids the data dependency in your loop: offset >= sizeof(WORDTYPE) depends on offset -= w which depends on all the stuff inside the loop. It is probably better to calculate a safe upper bound on the number of iterations in the inner loop in advance and then have that loop run at full tilt.

I'll probably have a go at coding it up... but don't feel you have to wait for me ;)

---

PS: Neither the code nor the pre tag is accepted -- does anybody know of an alternative for keeping code formating unbungled?
 
Ah, well... the experiment didn't pan out as well as I'd hoped.

For one, gcc in its infinite wisdom, turned my inner while (iter--) {...} loop into something like for (tmp=0; tmp < iter; tmp++) { ... }, where tmp is register allocated and iter is not. Brilliant loop conversion on a register-starved machine :(

(yes, I know you can fomit the frame pointer)
 
Thanks Peter, nice observation.
 
Thanks.

It turns out the multiplication becomes a bunch of shifts + add + lea, most of which are slower on a P4 than on anything else. This could explain your measurements being different from hpj's (see http://hpj.blognaco.com/2005/11/03/or-is-it/).

On my slow 450 MHz PIII laptop, your two versions are slower than anything else, except for Spanish/Finnish/Danish where they are slightly faster than the glib version.

Add the twist I suggested above and you have a function that is significantly faster for the mostly latin-1 languages. It is faster than the "untwisted" behdad versions on zh_TW/ko/ar but the slowdown compared to glib is still too big. Add a shortcut if offset < n * sizeof(WORDTYPE) and you reduce the slowdown to something that I think might be sufferable -- without losing much of the speedup on es/da/fi or even el (with an n of 4-6).

Only tested with gcc 3.4 on ubuntu breezy badger on the 450 MHz machine (for now).

---
Oh, btw, the silly loop conversion that I struggled with earlier can be worked around by making iter an unsigned char instead of an int.

I'll probably send some code + some timing charts tomorrow...
 
Are you still playing with this? I tried the following snippet whith your benchmark, on average it was an improvement. (athlon-xp 1900+, gcc 4.03, -O3 --march=atholn-xp)

Code follows, but indenting is all screwed:

#define john_g_utf8_skip(i) (gchar)((i&128) ? (i&64) ? (i&32) ? (i&16) ? (i&8) ? (i&4) ? (i&2) ? 1\
: 6\
: 5\
: 4\
: 3\
: 2\
: 1\
: 1)

#define john_g_utf8_next_char(p) (char *)((p) + john_one_g_utf8_skip(*(guchar *)(p)))
 
Post a Comment

Links to this post:

Create a Link



<< Archive
<< Home