Funny, how I read your post after Boston Summit, and I've had in fact discussed both issues with Owen Taylor while there. Lets see:
First, increasing cairo scaled font cache size: You saw a 230 KB process size increase when increasing cache size from 256 to 2048. Well, here's a little secret I decided not to disclose until someone notices it, and you kinda qualify now: Cairo doesn't free glyph renderings after uploading them to the X server! You definitely need them if you are using the image surface, but most processes don't. So, you've got the smart X server hashing and reusing glyph renderings, and cairo-using processes keeping a copy around, for no good. Fix that and happily increase cache size to 2048 without 230 KB size increase!
Next, time spent in HarfBuzz, and particularly recreating and enlarging buffers all the time. As my summit hacking project I took on optimizing HarfBuzz. In short, I did:
Make output buffer copy/swap'ing during GSUB processing lazy, such that if a lookup doesn't affect a glyph string, no glyph copying takes place.
Compile all of HarfBuzz as a single file, to let compiler do more optimizations. This increased HarfBuzz binary size from 100 KB to 150 KB. Compiling with -Os brings it down to 70 KB. May be worth profiling with -Os too.
Last but not least, cache one HB_Buffer.
All in all, in my measurements, these three made repeated text layout 10 to 20 percent faster for 1) very long paragraphs, and 2) using fonts with many many looksup like Nafees Nastaliq (more than 100). They made no difference for regular small text+font combinations.
Mandatory screenshot of Nafees Nastaliq after I fixed a bug to tolerate a font bug in the GDEF table. I'm surprised how good actually it worked with the synthesized GDEF Pango put together for it, but now it works perfect (again):