Behdad Esfahbod's daily notes on GNOME, Pango, Fedora, Persian Computing, Bob Dylan, and Dan Bern!

My Photo
Name:
Location: Toronto, Ontario, Canada

Ask Google.

Contact info
Google
Hacker Emblem Become a Friend of GNOME I Power Blogger
follow me on Twitter
Archives
July 2003
August 2003
October 2003
November 2003
December 2003
March 2004
April 2004
May 2004
July 2004
August 2004
September 2004
November 2004
March 2005
April 2005
May 2005
June 2005
July 2005
August 2005
September 2005
October 2005
November 2005
December 2005
January 2006
February 2006
March 2006
April 2006
May 2006
June 2006
July 2006
August 2006
September 2006
October 2006
November 2006
December 2006
January 2007
February 2007
March 2007
April 2007
May 2007
June 2007
July 2007
August 2007
September 2007
October 2007
November 2007
December 2007
January 2008
February 2008
March 2008
April 2008
May 2008
June 2008
July 2008
August 2008
October 2008
November 2008
December 2008
January 2009
March 2009
April 2009
May 2009
June 2009
July 2009
August 2009
November 2009
December 2009
March 2010
April 2010
May 2010
June 2010
July 2010
October 2010
November 2010
April 2011
May 2011
August 2011
September 2011
October 2011
November 2011
November 2012
June 2013
January 2014
Current Posts
McEs, A Hacker Life
Wednesday, December 10, 2008
 Improving Login Time, Part 3: bootcharting

Been a while since last instalment. So, having tightened up gnome-settings-daemon, I headed to get an overview of the entire login process, using the venerable bootchart. Before I jump in, bootchart is cool and everything, but if someone wants a smallish project to hack on, port the bootchart graphing tool from Java to pycairo. It's some 5k lines of code and should shrink considerably. Then we can make an interactive tool based on that and other pycairo scripts we have. I can see how that can turn into a profiler thingy in the long run...

For this assignment I updated my home machine to latest Fedora 10, created a new user, and bootcharted gnome-session in a warm login. I got the chart at right (click to enlarge). Lets see how it looks:This doesn't look right. Lets look at inside:

The CPU idle time

Let me quote parts of my first post about how gnome-session works:
To transition to the next phase, the current phase should either complete or time out. A phase is complete when all apps associated with the phase signal completion. An app can signal completion in a variety of ways, the simplest of which being that the process terminates. A phase times out after ten seconds. Description for the phases as well as a more a longer version of this condensed overview is available in gnome-session/gnome-session/README.
With that in mind, the two second idle looks a hell lot like a phase timeout. Inspecting ~/.xsession-errors and searching for gnome-session confirms that:
gnome-session: Application 'libcanberra-login-sound' failed to register before timeout
Oh oh! /usr/share/gnome/autostart/libcanberra-login-sound.desktop is where this application lives, and it contains these lines:
Exec=/usr/bin/canberra-gtk-play --id="desktop-login" --description="GNOME Login"
AutostartCondition=GNOME /desktop/gnome/sound/event_sounds
X-GNOME-Autostart-Phase=Desktop
That is, it runs canberra-gtk-play in the Desktop phase of gnome-session. So what's happening? gnome-session is waiting for canberra-gtk-play to finish playing login sound, how cool is that!? Not only gnome-session waits for it to finish playing, even when it does (after five seconds), gnome-session fails to notice that. So it waits another five seconds until the phase times out and it proceeds to the next phase. To verify this, note that in the graph, canberra-gtk-play starts at about 3.2 seconds in, and the CPU picks up activity again, at 13.2.

Why does it happen? Shouldn't gnome-session at least continue after the five-second playing is over? Well, yes according to the README document, no according to the code:
  if (manager->priv->phase == GSM_MANAGER_PHASE_INITIALIZATION) {
/* Applications from Initialization phase are considered
* registered when they exit normally. This is because
* they are expected to just do "something" and exit */
app_registered (app, manager);
}
So, only Initialization phase was handled that way. Not hard to fix. Bug filed, patch submitted. Then I also patched libcanberra to add a --background option to canberra-gtk-play. Reported and patch submitted. There is an alternate, much shorter, fix: move login sound to Application phase which is the last phase. Applications in the Application phase are not required to notify the session of their startup. That's a one line fix. The gnome-session fix is still a good thing to do. Anyway, with that fixed I got down from 18 to 13 seconds. Not bad!

The Mess

Lets see what's making the login take that long. The last batch of processes to start are the panel applets (clock-applet, wcnk-applet, trashapplet, gdm-user-switch, and notification-area) that start around 8 seconds in. The panel on the other hand starts at 3. What is the panel doing for those 5 seconds?! Time to switch tools and take a look inside panel. Federico looked at it in February, so I reused his patch and added more annotations as needed. The image at right is the first timeline view of what's happening inside gnome-panel. There are three large gaps there. The first one is the linker working hard. The second one is gnome_program_init(). I have absolutely no idea why those two are taking so long. Those two are very unstable though and go up and down in different runs. I will look into them later.

But the largest gap by far is the third one, which happens inside the first size_request(), and after I added enough annotations, it became clear that it's happening when pango is measuring fonts for the first time. Ouch! Checking the strace log, it becomes clear that fontconfig is opening three East Asian fonts. Why? It's not supposed to. Well, at least not if its cache is up to date. I then run fc-match and observe that it takes over a second. I try fc-cache and it errs about not being able to update cache for two East Asian font directories on my system. Humm, this is a first for me, although I confess I get reports about situations like this all the time... Reading more log, it appeared that opening cache files in ~/.fontconfig was failing as Permission denied! Weird stuff. Inspecting ~/.fontconfig revealed the reason: It had the following permissions: drw-rw-r--. Where did the x permissions go? This was a new user account, so the first process to call into fontconfig would have been gnome-settings-daemon. Oh my! In one of my patches from the last round, I had replaced a call to the glibc function daemon() with custom code that I lifted from preload, which I had lifted from some other daemon back in 2005. Anyway, seems like the daemonizing code there had a umask (0177); call in it, and no one's eyes ever caught the bug there. That mask is fine for files, but directories created under that are absolutely useless. And since Nov 25 we backported those changes into Fedora 10! Chaos! I had to go clean that up. Filed and fixed the bug in g-s-d first, and pushed fixed versions in F10 and rawhide. Then filed bug with fontconfig to fix it there too. Will look into that this week.

Next question was, why wasn't the system font cache updated anyway? Normally the users should not need to cache anything unless they have installed fonts in their home directory. I checked the RPM spec for the offending font packages. Dang, the spec files had a umask 0133 before calling fc-cache. Oops. Removed those and pushed updates out. However, that does *not* justify it as fc-cache doesn't need to create any directory when updating system-wide caches. That's still a mystery. Got to reinstall the broken RPMs to see what's going on.

Anyway, I then looked into gnome-settings-daemon to see why it was crashing. It was gstreamer crashing it. As I found last time, gstreamer initialization forks and tries to update the cache. The fork being there to not crash the parent process in case a plugin goes south. However, if updating the cache fails in the child process, it then retries from the parent. If that fails too, it just exits! The reason it wasn't updating the cache successfully? The same umask(0177). Now I remember, I can actually blame Vincent! When he was releasing GNOME 2.25.1. IRC log from November 5th:

vuntz: hrm, 2.25.1 doesn't look that good
vuntz: behdad: my friend!
vuntz: behdad: you're the one who touched main() in g-s-d
vuntz: behdad: g-s-d starts and quits after a while, so I have no theme, etc.
behdad: vuntz: huh?
behdad: vuntz: you mean the child dies?
vuntz: behdad: I don't know, I just don't have any theme
behdad: works fine here
behdad: do a ps
behdad: see if it's running
behdad: and look for g-s-d in .xsession-errors
vuntz: behdad: so, with gdb, I can see the theme applying after the fork()
vuntz: and disappearing on exit()
behdad: disappearing on exit?
vuntz: back to default theme when the parent exits
vuntz: so I guess the child exits too
***behdad wonders whether it has to do with broken pipes
vuntz: I'd guess so...
behdad: I did try to handle sigchld
behdad: ok, I need a definite answer to whether the child is running or not :)
vuntz: it does not run after the parent exits, at least
behdad: humm
behdad: does your .xsession-errors hint why?
vuntz: behdad: nothing
behdad: umm
vuntz: behdad: now I can blame you if I don't release 2.25.1 today, instead of blaming my headache :-)
behdad: humm
behdad: lemme see how to debug
behdad: does gdb tell why it exits?
vuntz: behdad: can I debug the child with gdb?
vuntz: oh, I guess I can just attach it
behdad: yeah
vuntz: interesting
vuntz: [gnome-settings-]
vuntz: that's what I have in ps after the fork
behdad: defuncts are zombies?
behdad: if you are still keeping parent live in gdb, that's expected after a child crash.
behdad: the question really is why it crashes.
vuntz: but why is the theme still applied?
behdad: vuntz: ok
behdad: run with --no-daemon?
mclasen: vuntz: gdb has a setting to follow the child on fork
vuntz: behdad: ah. It fails because gstreamer fails to init. Baah
behdad: bah
behdad: yeah, stupid gstreamer
behdad: forks and tries to initialize in child
behdad: then if that fails, tries to do the same in parent!
behdad: remove your gstreamer binary cache and recreate I guess
***behdad goes back to writing blog post
behdad: vuntz: you fixed your gsd?
vuntz: behdad: nah, but I'm sure it's a gstreamer issue now
behdad: ugh
behdad: ok, lemme know
behdad: I'm writing to gstreamer devel about their cache right now
vuntz: there's no cache for this user, fwiw
vuntz: so...

We totally had the bug in our hand and let it go. Looked so innocent for sure. Not very clear who let it go. Vincent says he'll sue me for posting this log anyway.

Later on I also got mail from a Fedora user saying that the umask thing also created problems for him with ORBit. I already reported to gstreamer devs, and will fix the fontconfig one myself. If anyone can look into ORBit, that would be great.

Lesson learned the hard way: In your library code, when you mkdir(), do a chmod() immediately. If you have not been doing it before, you may also want to consider trying a chmod if writing to the directory fails. umask issues have affected many many packages in Fedora before. Users install RPMs with limited umask and all kind of things break. better fix it in the libraries and programs, instead of distros adding umask to their specs and getting it wrong...

Oh, the timing. Fixing these, the huge gap in the panel shrank drastically, from 1.4 seconds to 0.13 seconds. 130ms is still a long time to spend in fontconfig. Seems like about 60% of that is just opening and reading/mmap'ing the config and cache files. That can be improved by reducing the number of files: packaging more font files in the same directory, and avoiding micro config files. Fortunately Nicolas is on the right track to fix these in Fedora. The other 40% is spent on the first FcFontSort() call. I have ideas about how to improve that, and will look into it in the coming days.

Do we need to call into fontconfig there? The panel needs to know the height of the default font to calculate the minimum panel height and enforce that. Sounds like a good idea, but knowing that the panel properties dialog does not let you resize it to below the minimum height, this check is redundant unless you managed to modify the gconf key directly or changed the font size while panel was not running. In other words, the exceptional case. If we can avoid that call, g-s-m can go to next phase some 130ms earlier.

There is another 90ms gap in the graph, and that is GTK+ finding the icon cache to use and loading the panel icon. Right, the panel has an icon! If I remove the gtk_window_set_default_icon_name() call the gap shrinks considerably, but then when you ctrl+alt+tab you won't see the cute gnome-panel icon. Donno, maybe we can set the icon later to avoid holding the entire session back for 90ms.

Lets step back and look though. fontconfig cache, gstreamer plugin cache, icon cache. We've got so many caches, which actually do wonders, but just finding and opening them still takes a lot of time. The gstreamer case is easy to fix since it has a single cache file in a single directory, it's just that it currently is doing unnecessary checks. The fontconfig and icon caches are harder to fix. The libraries have to look into multiple directories and read multiple config files to use the cache. It stinks. There used to be a short period of time that fontconfig generated one fat cache file for all the fonts. That's a terrible idea when you need to regenerate the cache every time a new font is installed. But what if we had a "git gc"-like mode to merge multiple cache files into one? New font dirs will go into new cache files, until next time, either manually or automatically, fc-cache --compact is invoked... I'll think about it.

I said "holding the entire session back" twice in previous paragraphs. This was not actually happening: gnome-panel was registering with the session as first thing in its main() (it's implied in the gnome_program_init() call), and the session was taking that as "I'm ready, go to next phase". Which means that Nautilus may start up and draw icons before the panel has had a chance to set its struts. Net result is that Nautilus icons will jump down when the panel gets to set up the struts. So I wrote patch for that. But that didn't fix the problem, another bug was in the play. Two hours of debugging later, I fixed that too.

The rest

With this mess cleaned, login time is back down to a sane 5.5s. gnome-panel is not taking much CPU time anymore. Nautilus some. Other than those, two things stand out: python and sealert. sealert is an applet to "View & interpret SELinux denials". It's written in python. Worse, when starting, it calls over D-BUS to start a server instance of itself, in another python process, and the two wait for SELinux denials to happen so they can report it to you using notification-daemon bubbles... for you to click and dismiss. Srsly, what the heck is Aunt Tillie to do with them? Forget Aunt Tillie, what is even a tech-savvy user to do with them? We were at the Fedora 10 release party in Seneca a couple weeks ago and a user asked me what these bubbles are. I explained to him, and he said "I just close them because they happen all the time". Which is true. It's great if you are debugging why your HTTP server does not serve your files, but is it good for anything else? Certainly not enough to be enabled by default. Please, disable sealert by default and help save the planet. In the mean time, I also proposed adding an additional phase to gnome-session which would run after the Application phase and is for starting processes that do not render to the desktop immediately, because those can start at the end, to give a better perception of desktop having loaded completely. sealert technically falls into that category. In practice however, it pops up a bubble before the login is done. In many setups at least.

I think the same argument also applies to the kerneloops-applet. Maybe enable these in Rawhide so people report bugs, but disable them in stable release. I'm pretty sure that's the right thing to do. kerneloops-applet is quite light on the CPU though. Removing sealert, kerneloops-applet, and some other lightweight tasks that I don't need, I get down to 4.5 seconds. The biggest CPU consumer now is Nautilus, and it's taking three seconds to settle down. Federico looked into it before, and I will too at some point. But I first want to fix fontconfig. The ROI on that is huge: every application startup can become something between 50ms to 100ms faster if I my ideas work out.

Big picture

One wise Federico once said "Set concrete goals", although people now attribute that wisdom to Arjan these days. Regardless, let's set a concrete goal: I want to make Fedora default install, new user, warm login in three second. That may not happen next year, but that's where we need to be if we ever want to boot, say, in 15 seconds. Why optimize warm login when cold is so much worse? Because to make cold login go under 5 seconds, one first has to make warm do that. My other argument is that ideally people suspend their computer instead of shutting down, so first boot is the wrong thing to optimize. Warm login on the other hand is still common in multi-user systems. Moreover, my hope is that a 3 second warm login plus opportunistic preload'ing at login screen will give us a first login of about 5 seconds. Long road to there; we will see. By the way, 3.5 years later, Federico's GUADEC keynote is still pure gold.

How would we get there? By making fontconfig not suck 100ms for each application. By fixing Nautilus, whatever it is doing that it shouldn't. By getting rid of libgnome dependencies. By figuring out why linking is so slow. And by not running sealert and other unnecessary stuff.

To keep things in perspective, lets have a quick look at cold login to see what's there to optimize. The image at right shows a cold login of the same desktop I used above. Lots of I/O, and 14 seconds login time. The CPU goes way down for extend periods of time, when panel, Nautilus, pulseaudio, and python are all doing heavy I/O. No idea what they are reading, but looks like something that should be possible to fix.

This next one is with preload running. Down to 12 seconds. Not much really. Looks like it failed to preload anything that those four I/O-heavy processes use. No wonder, because preload can only preload stuff that are mmap()ed. That mostly translates to shared libraries. A more advanced prefetching solution, like Harold's readahead_collector that monitors all I/O will be able to solve this. Oh, Harold by the way blogged recently about using systemtap to measure boot-time I/O. He's actually looking at me to fix gconf I/O issues, and that's exactly what I'm going to do with his script when I get to that stage.

If you read this far, I owe you your beverage of choice. Find me at GUADEC+Akademy!

Labels: , , , ,

Comments:
Great work and great blog post.

You deserve an applause.
 
I'm not a developer but these optimization articles are still very interesting to read. Keep up the good work!
 
Simply great!

Can wait for package in F10 update...
Any tips ?
 
glad to hear you're working on tightening gnome startup. I'd love to be running a normal gnome session on my testboxes, which is basically a matter of startup time and reliable startup with recently-killed sessions.
 
about disabling kernel-oops, I think that is a bad idea.
Changes on the kernel comes and goes.
Strange hardware comes and goes.
And many users never tuches rawhide, but yet uses the hardware.

That said there could be a lot things to do to make it generate wake-ups, I am sure. But the input on the sometimes somewhat obscure but still valid uses of linux that can trigger strange oopses can really make a diffrense if they are submitted and not only just left in the dmesg to rot.
 
I love every single one of these posts. They teach a lot about the process of analysis and optimization. I've been using them to train members of my team at work!
 
awesome work! can't wait for this to hit updates and mainline!
 
no, i am pretty sure that we owe you your beverage of choice. is there a gnome beer fund we can donate to?
 
Very nice posts :)! Thanks for writing this stuff up, it's fascinating.
 
I'm quite puzzled how fontconfig can be taking ~100 ms on your system Behdad? It is something like 6 ms here for a program that only does FcInit();.

I guess that you've got more fonts installed than I do, but enough to explain such a big difference?

$ fc-list | wc -l
116
$ fc-cat | wc -l
535

Also, as far as I can tell it only took ~40 ms in Part 2 of your writings.
 
Really interesting article. Cheers Behdad. You've got it the wrong way around... we all owe you a drink.
 
Absolutely fascinating; love the detail! It must take almost as much time -- if not more -- to write up the blow-by-blow account than to actually do the work. Both are incredibly valuable!
 
Very interesting. As a user it's nice to see some of the detail of people investigating and improving my experience.

I have to say though that as a user I greatly appreciate the functionality sealert brings to the table.

It's much nicer than working through SELinux denials manually. Perhaps Aunt Tilley isn't the sort of person who runs a webserver or database on their machine but I imagine a lot of Fedora users are that sort of person.

To that extent I think having something nice that suggests good policy is valuable and might help avoid people making bad choices or turning SElinux off completely.

On the other hand I don't imagine it's something that would be useful all that often other than when configuring a system. So perhaps something a bit "smaller" could be listening for alerts and then launch an app to handle them.
 
Maybe this bugs could be related with g-s-d crashes reported by vuntz on IRC:
https://qa.mandriva.com/show_bug.cgi?id=43783
http://bugs.gentoo.org/show_bug.cgi?id=239293

?

Thanks a lot for your help! :-D

Pacho Ramos
 
Good job!!

Cheers!

Harald Hoyer
 
I was amazed how much I enjoyed reading an article of which a good 90% I didn't understand :)

Great work!
 
I completely agree with Jimmac's last comment here.
This is the kind of postings that I love Planet GNOME for. One day when I've grown up I will become a hacker too! Double true! Or maybe next life. :-P
 
I have a stupid question: how can you bootchat the user session? I know how to install bootchart, but I have no clue how to use it for the user session. I would be curious to try it on my computers. Thanks if you have some advices!
 
Thank you Behdad. That was another great story. Although I don't understand everything it still is really cool to see the direct link between the analysis of unwanted behaviour, a rapid solution and an immediately improved user experience. This peek "under the hood" really does it for me. So thanks again and keep up the great work. I look forward to part 4 :)
 
Interesting post. I was wondering if fontconfig can't load the fonts on demand?
 
I love your writing style. You provide a real insight in to the activities of gnome developers. Many thanks for the fascinating read!
 
the gnome world needs more optimization heroes like you. You are doing an absolutely fantastic job. Deepest thanks for keeping the "3 secs gnome login time" ideal alive, your efforts are VERY appreciated.

-- A gnome user
 
behdad I want to have your children!
 
I'm also not a developer and this is still fascinating. Keep up the great work! My desktop thanks you.
 
You are great! This degree of professionalism ensures a great product.
 
Like Rob said, it's the other way around: we owe you your beverage of choice.
 
Great job!

I have a lot of applets on my desktop, try adding all applets to the panel and see how long that takes - it takes ages!
 
Great work Behdad. It's good to see such a thorough investigation into improving the GNOME user experience.
 
Nice work. Although Fedora also has compiz enabled by default, and it churns for another 5 seconds doing apparently nothing too productive. So far I haven't seen any compiz startup optimization being done in the upstream (there is mostly just one developer).
 
Hi Behdad,

I started doing a PyGTK/PyCairo gui for bootchart. Here is the project. I hope to have it finished before 2009.

Best
Anders
 
http://code.google.com/p/pybootchartgui/
 
Totally agree about sealert. How do we go about getting it removed from the default Fedora installation?
 
my jaws dropped to this post! awesome job behdad.
 
Outstanding post, Behdad. I really enjoyed reading it and I benefited from it immensely. I applaud the amount of work you've put into this effort, which I believe is going to benefit a vast amount of free software users.

I really admire the effort you've put into these optimizations, but I admire as much or moreso the time & effort you've put into communicating what you've found with us as you've done here.
 
Thanks everyone for your kind words. I'm glad that people are finding these posts educational, as that's exactly why I spend the time to write them in this level of detail.

Harald Hoyer recently posted his work on Fedora 10 boot analysis too. Check it out.

@AUDU: The login sound bug should be already fixed in F10 updates.

@Simon: Two things: FcInit() takes some time, and then FcFontSort() takes more time. The two combined is what I'm seeing here. But you have a fair point. The times look unrealistically high. I looked into it and seems like in this scenario the strace overhead is very nonneglibigle. Without strace it takes 15ms to init, and 40ms to init+sort. With strace, it's 120ms. So yeah, I was a factor of three off there :(. Thanks a lot for pointing out.

@Francois: I modified /usr/share/xsessions/gnome.desktop to run a shell script that I wrote. The shell script starts bootchart and then runs /usr/bin/gnome-session. You also need a way to stop the bootchart. You can add something like ( sleep 30; /sbin/bootchartd stop & ) in your shell script.

@Anonymous: Fontconfig is not loading fonts. It's loading caches. It needs to load the caches to know what fonts you have installed...

@Anders: Great! I checked the code out. Looks good already. I look forward to using it for my next series of bootcharting!

@Luis: I think Matthias is already on his way to remove sealert from the Desktop spin.
 
Thankyou for such interesting posts and good work :-)
 
Apparently Anders Norgaard and Henning Niss implemented the bootchart graphing tool in Python:

http://code.google.com/p/pybootchartgui/

Good job guys!
 
Behdad groupies want before Behdad's magic old Fedora bootcharts next to a bootchart including Behdad wondrous work so they can marvel of Behdad's coolness for hours on end.

Outstanding job Behdad - an excellent read!

PS: Behdad FTW!
 
Post a Comment

Links to this post:

Create a Link



<< Archive
<< Home