McEs, A Hacker Life: UTF-8 Bit Manipulation

Wednesday, April 23, 2008

UTF-8 Bit Manipulation

Roozbeh:

My reasoning was that the current code is not worth changing without strong profiling data showing measurable gain in real-world use cases. That's all.

Now to your solution and questions. You approach has two and a half major issues that make it unusable in real code:

As you mention yourself, it uses 64-bit math,
It assumes that shifting an integer more than its width results in zero. That's undefined by the C language,
It does a "x*3". On many processors that's best implemented as "(x<<1)+x".

So while it theoretically works, using nine integer operations, in practice it's unusable. Oh, your function produces the exact same values as in the glib table BTW. That's good.

Here is my solution that can be written as valid C code using 13 simple 32-bit operations:

def behdad_utf8_skipper(c):
        v = c ^ 0xff
        r = (v > 0xF) << 2
        v >>= r
        s = (v > 0x3) << 1
        v >>= s
        r |= s
        r |= (v >> 1)
        return (0x11234561 >> (r << 2)) & 7

It's basically a negation followed by a log2 straight from bithacks, followed by a table lookup. I particularly like the beautiful final constant.

I leave it to others to measure if this is faster than the lookup-table in glib. Enjoyed working this out though. Everyone, go crazy, shove a few ops off!

Labels: bithacks, glib, utf8

¶ 9:17 PM

Comments:

Your second objection is valid (the undefinedness of too many shifts). Your first objection (to the 64-bit math) isn't as strong as you might think, since the 64-bit shift can often be done with only two 32-bit shift instructions (using the carry bit).

Your third objection, that x*3 should be written (x<<1)+x, is bogus. On almost all modern processors, the former is faster; on the few where it is not (and I can't think of any), gcc will automatically replace the multiply by shifts and adds.

# posted by

Joe : April 23, 2008 10:22 PM

Thanks Joe.

I'm not sure I understand your point about 64-bit shift. Isn't that only valid for shifting one bit? How do you do "i64>>j" using two 32-bit shift instructions?

Regarding "x*3", I wasn't suggesting that it should be written as "(x<<1)+x". All I was saying was that there are architectures on which gcc does this. So, when counting instructions, multiplications are not necessarily the same as addition, bitwise, or shift operations. That's all. Now when you say all modern processors, do you mean ARM too?

behdad

# posted by

behdad : April 23, 2008 11:59 PM

About Me

Twitter Updates