Pango hacking revisited
First, good news,
Lars Knoll sent
Owen Taylor and I a note about merging the OpenType Layout code in
Pango and Qt. So, seems like the last item of
my Pango roadmap will be RESOLVED FIXED without me doing all the work. Sweet.
The rest of this note is mostly in response to
Federico's blog posts on Pango. For the first time, I'm advertising a piece of code I wrote a few years ago for compressing data tables efficiently. So it may be of interest to people not interested otherwise too.
In response to Federico's
Glyph lookups (aka. Profiling the file chooser, part 5), my only concern is that of thread-safety, and scalability to multithreaded applications on multiprocessor systems. About thread-safety, I don't see any means to prevent race conditions when accessing the cache. Scalability is another well-known issue, my concern is that the cache may turn out to be a bottleneck when multiple threads try to access it in parallel. Discussion in the
GMemChunk bug looks relevant.
Now somehow I managed to miss the excellent work in Federico's next report, the
Pango gunichar->script and paired-character mappings work (aka. Profiling the file chooser, part 6) There's a lot of good points raised in that post, and I
damn wish I could make it to the
GNOME Summit (as a course in anger management, I really should keep reminding myself that I hold an Iranian passport more often.) Unfortunately Owen doesn't blog about work that often, and in fact it's a while since I last noticed him working on Pango. But I will try to find him and manage integrating some of these works in November. Now to the technicalities:
There are three main offenders that Federico, Billy, and Owen tackle: gunichar->script mapping, paired-characters table, and pango_log2vis_get_embedding_level. This last one is in fact
FriBidi code copied inside Pango. I'll cover it last. The two other problems are both looking up a property for a Unicode character, and the current code is using a binary search on the run-length encoded tables. That's indeed the native approach, and slow. Now you may be amazed to know how these two relate to FriBidi too!
In FriBidi, currently we have two Unicode data tables: One mapping Unicode characters to their bidi-category, some 19 different values; the other one mapping Unicode characters to their
mirrored characters if any, which is basically an identity mapping for almost all characters, except for things like parentheses and brackets, which are mapped to their pair. If these two kinda look like the two tables in Pango, you've got my point. So I'll elaborate on how we handled them in FriBidi:
Initially when FriBidi was written in 1999, a bsearch table generated by a Perl script was being used for both bidi-types and mirroring-char. Then in 2000, Owen sent a patch to use a two-level lookup table that drastically improved the performance. At that time Unicode still was a 16-bit character set! So the two-level was a simple [256][256] split. Later on when I showed up and started hacking all around FriBidi. Unicode jumped to the current 21-bit setting and all in a sudden there was not trivial best split. So I retired the Perl script, and wrote a generic table compressor that given the maximum number of lookups (levels) you can afford, it finds the optimal break down of the table and generates C code for that. Later on Owen suggested to use indices instead of pointers, to reduce relocation time. That was done, and we had a shiny compressor which was able to compress the bidi-type data for the whole Unicode range into 24kb for a two-level table, or 5kb for a four-level table, or 2.5kb for a nine-level table, or anything in between! We left the mirroring code intact, using the bsearch (that's not used in Pango though, Pango uses mirroring functions from glib). Later on when I was working on the new FriBidi code (not released yet), I improved the compressor more, and decided to use it for mirroring table too. Now, like any good compressor, it will compress pretty good if the (information theoretic) information in the table is small compared to the size of the table, which is the case for most of Unicode properties, that take one of a few values, with some kind of locality effect. For mirroring it was a bit different, most of the characters were mapped to themselves. What I did was to create a new table, which was the (mirrored_unichar - unichar) value. Now this new table was
really sparse. Applied the compressor, lo and behold, down from 2656-bytes to 944-bytes, and instead of bsearch on a 322-entry array, to 4 lookups. Woot!
Now you would say the catch? The catch is that when I wrote that compressor code back in 2001, I was just moving on from programming contests to Free Software hacking, and using one-letter variable names in a backtrack still seemed like a good idea. The code works amazingly good though, and I do actually go take a look at it once in a while to internalize how to
not write code. I have improved since.
The stuff:
Ok. I think the two samples are simple enough to get anybody going with the compressor. Would be interesting to see how tables generated using it will perform in Pango. Other than that, The existence of this problem in the first place is a good reason why
I believe that all of the
Unicode Character Database properties should be efficiently exposed in a central library, part of
Project Giulia.
One final paragraph about pango_log2vis_get_embedding_level performance. There's been
some discussion about short-circuiting that function for LTR-only text, and that's on the horizon, but probably after updating the mini-fribidi code with upstream. That said, some of the performance problems with that function will be solved after resynching and enabling the trash-stack. Currently it's malloc()ing about twice per word of text.