UTF-8 Bit Manipulation
Roozbeh:
My reasoning was that the current code is not worth changing without strong profiling data showing measurable gain in real-world use cases. That's all.
Now to your solution and questions. You approach has two and a half major issues that make it unusable in real code:
- As you mention yourself, it uses 64-bit math,
- It assumes that shifting an integer more than its width results in zero. That's undefined by the C language,
- It does a "x*3". On many processors that's best implemented as "(x<<1)+x".
So while it theoretically works, using nine integer operations, in practice it's unusable. Oh, your function produces the exact same values as in the glib table BTW. That's good.
Here is my solution that can be written as valid C code using 13 simple 32-bit operations:
def behdad_utf8_skipper(c):
v = c ^ 0xff
r = (v > 0xF) << 2
v >>= r
s = (v > 0x3) << 1
v >>= s
r |= s
r |= (v >> 1)
return (0x11234561 >> (r << 2)) & 7
It's basically a negation followed by a log2 straight from
bithacks, followed by a table lookup. I particularly like the beautiful final constant.
I leave it to others to measure if this is faster than the lookup-table in glib. Enjoyed working this out though. Everyone, go crazy, shove a few ops off!
Labels: bithacks, glib, utf8