Year-end Cleaning (ie. on warning options)
Benjamin Otte recently
blogged about compiler warning options. Now, to use or not to use
-Werror
is something Benjamin and I have wasted hours discussing, but that's not the point here.
The reason that I thought I blog and link to his post is that compiler warnings are very useful tools for writing good code. These days, when I run my C code, it typically works the first time. That's not as much a complement to me as it is to my compiler. That's because by the time I finish fixing all the warnings, all the trivial bugs are gone too. Any C programmer must have a large set of warning options he enables, and here is mine:
-fno-common
-Wall
-Wdeclaration-after-statement
-Wextra
-Wformat=2
-Winit-self
-Winline
-Wpacked
-Wp,-D_FORTIFY_SOURCE=2
-Wpointer-arith
-Wlarger-than-65500
-Wmissing-declarations
-Wmissing-format-attribute
-Wmissing-noreturn
-Wmissing-prototypes
-Wnested-externs
-Wold-style-definition
-Wredundant-decls
-Wsign-compare
-Wstrict-aliasing=2
-Wstrict-prototypes
-Wswitch-enum
-Wundef
-Wunreachable-code
-Wunsafe-loop-optimizations
-Wwrite-strings
These ones I've used in the past but have disabled for one reason or the other:
-Wbad-function-cast
-Werror-implicit-function-declaration
-Wfloat-equal
-Wmissing-field-initializers
-Wmissing-include-dirs
-Wswitch-default
There are two that stand out:
-fno-common
is not a warning flag strictly speaking, but it's as good as one.
-Wp,-D_FORTIFY_SOURCE=2
is not a warning flag either, but it enables some extra checks in glibc headers.
Benjamin's post explains each warning he uses and why it's a good idea. When in doubt,
man gcc
is your friend.
Now, the trick to using these is to always remember that these are warnings. Some will trigger in very legitimate code. That's why
-Werror
is evil. Warning flags are a very personal thing. I don't like cluttering code I maintain with them.
Anyway, I went ahead and fixed all warnings that Pango and Vte compilation generated. Now I can see if my new code generates a warning. Yay.
Speaking of vte, it has been seeing some new development. Last month I wrote a pangocairo rendering backend for it. Later I implemented support for
combining characters. That wasn't easy as the vte code has a one-char-per-cell model hardcoded all over the code. I retro-fitted more-than-one-char-per-cell on it vy using a new data type called
vteunistr. ChPe has been adding new API, including adding GObject properties. And we all have been ripping old code out. Here is the summary of changes for the past six weeks:
99 files changed, 6317 insertions(+), 14094 deletions(-)
Labels: gcc, vte, warnings
Improving Login Time, Part 3: bootcharting
Been a while since last instalment. So, having tightened up gnome-settings-daemon, I headed to get an overview of the entire login process, using the venerable
bootchart. Before I jump in, bootchart is cool and everything, but if someone wants a smallish project to hack on, port the bootchart graphing tool from Java to pycairo. It's some 5k lines of code and should shrink considerably. Then we can make an interactive tool based on that and other pycairo scripts we have. I can see how that can turn into a profiler thingy in the long run...
For this assignment I updated my home machine to latest Fedora 10, created a new user, and bootcharted gnome-session in a warm login. I got the chart at right (click to enlarge). Lets see how it looks:
- The CPU becomes idle at 18 seconds in. The little activity lingering after is the bootchart overhead. That's a long time to log in!
- Doh, the CPU becomes idle for a couple of seconds in the middle.
- There's intense CPU activity periods caused by a many processes, namely, gnome-settings-daemon, wnck-applet, clock-applet, gdm-user-switch, and less intense in gnome-panel, nautilus, notification-daemon, and sealert.
- gnome-settings-daemon is terminating!
This doesn't look right. Lets look at inside:
The CPU idle time
Let me quote parts of
my first post about how gnome-session works:
To transition to the next phase, the current phase should either complete or time out. A phase is complete when all apps associated with the phase signal completion. An app can signal completion in a variety of ways, the simplest of which being that the process terminates. A phase times out after ten seconds. Description for the phases as well as a more a longer version of this condensed overview is available in gnome-session/gnome-session/README.
With that in mind, the two second idle looks a hell lot like a phase timeout. Inspecting
~/.xsession-errors
and searching for gnome-session confirms that:
gnome-session: Application 'libcanberra-login-sound' failed to register before timeout
Oh oh!
/usr/share/gnome/autostart/libcanberra-login-sound.desktop
is where this
application lives, and it contains these lines:
Exec=/usr/bin/canberra-gtk-play --id="desktop-login" --description="GNOME Login"
AutostartCondition=GNOME /desktop/gnome/sound/event_sounds
X-GNOME-Autostart-Phase=Desktop
That is, it runs
canberra-gtk-play
in the Desktop phase of gnome-session. So what's happening? gnome-session is waiting for canberra-gtk-play to finish playing login sound, how cool is that!? Not only gnome-session waits for it to finish playing, even when it does (after five seconds), gnome-session fails to notice that. So it waits another five seconds until the phase times out and it proceeds to the next phase. To verify this, note that in the graph, canberra-gtk-play starts at about 3.2 seconds in, and the CPU picks up activity again, at 13.2.
Why does it happen? Shouldn't gnome-session at least continue after the five-second playing is over? Well, yes according to the README document, no according to the code:
if (manager->priv->phase == GSM_MANAGER_PHASE_INITIALIZATION) {
/* Applications from Initialization phase are considered
* registered when they exit normally. This is because
* they are expected to just do "something" and exit */
app_registered (app, manager);
}
So, only Initialization phase was handled that way. Not hard to fix. Bug
filed,
patch submitted. Then I also patched libcanberra to add a --background option to canberra-gtk-play.
Reported and
patch submitted. There is an alternate, much shorter, fix: move login sound to Application phase which is the last phase. Applications in the Application phase are not required to notify the session of their startup. That's a one line fix. The gnome-session fix is still a good thing to do. Anyway, with that fixed I got down from 18 to 13 seconds. Not bad!
The Mess
Lets see what's making the login take that long. The last batch of processes to start are the panel applets (clock-applet, wcnk-applet, trashapplet, gdm-user-switch, and notification-area) that start around 8 seconds in. The panel on the other hand starts at 3. What is the panel doing for those 5 seconds?! Time to switch tools and take a look inside panel. Federico
looked at it in February, so I reused his patch and added more annotations as needed. The image at right is the first timeline view of what's happening inside gnome-panel. There are three large gaps there. The first one is the linker working hard. The second one is gnome_program_init(). I have absolutely no idea why those two are taking so long. Those two are very unstable though and go up and down in different runs. I will look into them later.
But the largest gap by far is the third one, which happens inside the first size_request(), and after I added enough annotations, it became clear that it's happening when pango is measuring fonts for the first time. Ouch! Checking the
strace
log, it becomes clear that fontconfig is opening three East Asian fonts. Why? It's not supposed to. Well, at least not if its cache is up to date. I then run
fc-match
and observe that it takes over a second. I try
fc-cache
and it errs about not being able to update cache for two East Asian font directories on my system. Humm, this is a first for me, although I confess I get reports about situations like this all the time... Reading more log, it appeared that opening cache files in
~/.fontconfig
was failing as Permission denied! Weird stuff. Inspecting ~/.fontconfig revealed the reason: It had the following permissions:
drw-rw-r--
. Where did the
x
permissions go? This was a new user account, so the first process to call into fontconfig would have been gnome-settings-daemon. Oh my! In one of my patches from
the last round, I had replaced a call to the glibc function
daemon()
with custom code that I lifted from
preload, which I had lifted from some other daemon back in 2005. Anyway, seems like the daemonizing code there had a
umask (0177);
call in it, and no one's eyes ever caught the bug there. That mask is fine for files, but directories created under that are absolutely useless. And since Nov 25 we backported those changes into Fedora 10! Chaos! I had to go clean that up. Filed and fixed the
bug in g-s-d first, and pushed fixed versions in F10 and rawhide. Then filed
bug with fontconfig to fix it there too. Will look into that this week.
Next question was, why wasn't the system font cache updated anyway? Normally the users should not need to cache anything unless they have installed fonts in their home directory. I checked the RPM spec for the offending font packages. Dang, the spec files had a
umask 0133
before calling
fc-cache
. Oops. Removed those and pushed updates out. However, that does *not* justify it as fc-cache doesn't need to create any directory when updating system-wide caches. That's still a mystery. Got to reinstall the broken RPMs to see what's going on.
Anyway, I then looked into gnome-settings-daemon to see why it was crashing. It was gstreamer crashing it. As I
found last time, gstreamer initialization forks and tries to update the cache. The fork being there to not crash the parent process in case a plugin goes south. However, if updating the cache fails in the child process, it then retries from the parent. If that fails too, it just exits! The reason it wasn't updating the cache successfully? The same umask(0177). Now I remember, I can actually blame
Vincent! When he was releasing GNOME 2.25.1. IRC log from November 5th:
vuntz: hrm, 2.25.1 doesn't look that good
vuntz: behdad: my friend!
vuntz: behdad: you're the one who touched main() in g-s-d
vuntz: behdad: g-s-d starts and quits after a while, so I have no theme, etc.
behdad: vuntz: huh?
behdad: vuntz: you mean the child dies?
vuntz: behdad: I don't know, I just don't have any theme
behdad: works fine here
behdad: do a ps
behdad: see if it's running
behdad: and look for g-s-d in .xsession-errors
vuntz: behdad: so, with gdb, I can see the theme applying after the fork()
vuntz: and disappearing on exit()
behdad: disappearing on exit?
vuntz: back to default theme when the parent exits
vuntz: so I guess the child exits too
***behdad wonders whether it has to do with broken pipes
vuntz: I'd guess so...
behdad: I did try to handle sigchld
behdad: ok, I need a definite answer to whether the child is running or not :)
vuntz: it does not run after the parent exits, at least
behdad: humm
behdad: does your .xsession-errors hint why?
vuntz: behdad: nothing
behdad: umm
vuntz: behdad: now I can blame you if I don't release 2.25.1 today, instead of blaming my headache :-)
behdad: humm
behdad: lemme see how to debug
behdad: does gdb tell why it exits?
vuntz: behdad: can I debug the child with gdb?
vuntz: oh, I guess I can just attach it
behdad: yeah
vuntz: interesting
vuntz: [gnome-settings-]
vuntz: that's what I have in ps after the fork
behdad: defuncts are zombies?
behdad: if you are still keeping parent live in gdb, that's expected after a child crash.
behdad: the question really is why it crashes.
vuntz: but why is the theme still applied?
behdad: vuntz: ok
behdad: run with --no-daemon?
mclasen: vuntz: gdb has a setting to follow the child on fork
vuntz: behdad: ah. It fails because gstreamer fails to init. Baah
behdad: bah
behdad: yeah, stupid gstreamer
behdad: forks and tries to initialize in child
behdad: then if that fails, tries to do the same in parent!
behdad: remove your gstreamer binary cache and recreate I guess
***behdad goes back to writing blog post
behdad: vuntz: you fixed your gsd?
vuntz: behdad: nah, but I'm sure it's a gstreamer issue now
behdad: ugh
behdad: ok, lemme know
behdad: I'm writing to gstreamer devel about their cache right now
vuntz: there's no cache for this user, fwiw
vuntz: so...
We totally had the bug in our hand and let it go. Looked so innocent for sure. Not very clear who let it go. Vincent says he'll sue me for posting this log anyway.
Later on I also got mail from a Fedora user saying that the umask thing also created problems for him with ORBit. I already
reported to gstreamer devs, and will fix the fontconfig one myself. If anyone can look into ORBit, that would be great.
Lesson learned the hard way: In your library code, when you mkdir(), do a chmod() immediately. If you have not been doing it before, you may also want to consider trying a chmod if writing to the directory fails. umask issues have affected many many packages in Fedora before. Users install RPMs with limited umask and all kind of things break. better fix it in the libraries and programs, instead of distros adding umask to their specs and getting it wrong...
Oh, the timing. Fixing these, the huge gap in the panel shrank drastically, from 1.4 seconds to 0.13 seconds. 130ms is still a long time to spend in fontconfig. Seems like about 60% of that is just opening and reading/mmap'ing the config and cache files. That can be improved by reducing the number of files: packaging more font files in the same directory, and avoiding micro config files. Fortunately Nicolas is on the right track to fix these in Fedora. The other 40% is spent on the first FcFontSort() call. I have ideas about how to improve that, and will look into it in the coming days.
Do we need to call into fontconfig there? The panel needs to know the height of the default font to calculate the minimum panel height and enforce that. Sounds like a good idea, but knowing that the panel properties dialog does not let you resize it to below the minimum height, this check is redundant unless you managed to modify the gconf key directly or changed the font size while panel was not running. In other words, the exceptional case. If we can avoid that call, g-s-m can go to next phase some 130ms earlier.
There is another 90ms gap in the graph, and that is GTK+ finding the icon cache to use and loading the panel icon. Right, the panel has an icon! If I remove the
gtk_window_set_default_icon_name()
call the gap shrinks considerably, but then when you ctrl+alt+tab you won't see the cute gnome-panel icon. Donno, maybe we can set the icon later to avoid holding the entire session back for 90ms.
Lets step back and look though. fontconfig cache, gstreamer plugin cache, icon cache. We've got so many caches, which actually do wonders, but just finding and opening them still takes a lot of time. The gstreamer case is easy to fix since it has a single cache file in a single directory, it's just that it currently is doing unnecessary checks. The fontconfig and icon caches are harder to fix. The libraries have to look into multiple directories and read multiple config files to use the cache. It stinks. There used to be a short period of time that fontconfig generated one fat cache file for all the fonts. That's a terrible idea when you need to regenerate the cache every time a new font is installed. But what if we had a "git gc"-like mode to merge multiple cache files into one? New font dirs will go into new cache files, until next time, either manually or automatically,
fc-cache --compact
is invoked... I'll think about it.
I said "holding the entire session back" twice in previous paragraphs. This was not actually happening: gnome-panel was registering with the session as first thing in its main() (it's implied in the gnome_program_init() call), and the session was taking that as "I'm ready, go to next phase". Which means that Nautilus may start up and draw icons before the panel has had a chance to set its struts. Net result is that Nautilus icons will jump down when the panel gets to set up the struts. So I
wrote patch for that. But that didn't fix the problem, another bug was in the play. Two hours of debugging later, I
fixed that too.
The rest
With this mess cleaned, login time is back down to a sane 5.5s. gnome-panel is not taking much CPU time anymore. Nautilus some. Other than those, two things stand out:
python
and
sealert
. sealert is an applet to "View & interpret SELinux denials". It's written in python. Worse, when starting, it calls over D-BUS to start a server instance of itself, in another python process, and the two wait for SELinux denials to happen so they can report it to you using notification-daemon bubbles... for you to click and dismiss. Srsly, what the heck is Aunt Tillie to do with them? Forget Aunt Tillie, what is even a tech-savvy user to do with them? We were at the Fedora 10
release party in Seneca a couple weeks ago and a user asked me what these bubbles are. I explained to him, and he said "I just close them because they happen all the time". Which is true. It's great if you are debugging why your HTTP server does not serve your files, but is it good for anything else? Certainly not enough to be enabled by default. Please, disable sealert by default and help save the planet. In the mean time, I also
proposed adding an additional phase to gnome-session which would run after the Application phase and is for starting processes that do not render to the desktop immediately, because those can start at the end, to give a better perception of desktop having loaded completely. sealert technically falls into that category. In practice however, it pops up a bubble before the login is done. In many setups at least.
I think the same argument also applies to the
kerneloops-applet
. Maybe enable these in Rawhide so people report bugs, but disable them in stable release. I'm pretty sure that's the right thing to do. kerneloops-applet is quite light on the CPU though. Removing sealert, kerneloops-applet, and some other lightweight tasks that I don't need, I get down to 4.5 seconds. The biggest CPU consumer now is Nautilus, and it's taking three seconds to settle down. Federico
looked into it before, and I will too at some point. But I first want to fix fontconfig. The
ROI on that is huge: every application startup can become something between 50ms to 100ms faster if I my ideas work out.
Big picture
One wise Federico once
said "Set concrete goals", although people now attribute that wisdom to
Arjan these days. Regardless, let's set a concrete goal: I want to make Fedora default install, new user, warm login in three second. That may not happen next year, but that's where we need to be if we ever want to boot, say, in 15 seconds. Why optimize warm login when cold is so much worse? Because to make cold login go under 5 seconds, one first has to make warm do that. My other argument is that ideally people suspend their computer instead of shutting down, so first boot is the wrong thing to optimize. Warm login on the other hand is still common in multi-user systems. Moreover, my hope is that a 3 second warm login plus opportunistic preload'ing at login screen will give us a first login of about 5 seconds. Long road to there; we will see. By the way, 3.5 years later, Federico's GUADEC
keynote is still pure gold.
How would we get there? By making fontconfig not suck 100ms for each application. By fixing Nautilus, whatever it is doing that it shouldn't. By getting rid of libgnome dependencies. By figuring out why linking is so slow. And by not running sealert and other unnecessary stuff.
To keep things in perspective, lets have a quick look at cold login to see what's there to optimize. The image at right shows a cold login of the same desktop I used above. Lots of I/O, and 14 seconds login time. The CPU goes way down for extend periods of time, when panel, Nautilus, pulseaudio, and python are all doing heavy I/O. No idea what they are reading, but looks like something that should be possible to fix.
This next one is with preload running. Down to 12 seconds. Not much really. Looks like it failed to preload anything that those four I/O-heavy processes use. No wonder, because preload can only preload stuff that are mmap()ed. That mostly translates to shared libraries. A more advanced prefetching solution, like Harold's readahead_collector that monitors all I/O will be able to solve this. Oh, Harold by the way
blogged recently about using
systemtap to measure boot-time I/O. He's actually looking at me to fix gconf I/O issues, and that's exactly what I'm going to do with his script when I get to that stage.
If you read this far, I owe you your beverage of choice. Find me at
GUADEC+Akademy!
Labels: gnome, gnome-panel, gnome-session, login, performance