Why I spent last summer rebooting Windows
How to compress several months of research into 50 lines of code and unblock a product launch, for fun and profit. Home parlor entertainment edition. Fun for all ages. Warranty void if seal broken. Offer not available in some states.
Now that others have found the same technique that I used to drop Chrome and Chrome Frame cold start times by more than 2/3 last summer, I feel compelled to explain what’s going on. Particularly as there seems to be some confusion over in the Hacker News thread.
First, a couple of notes and caveats. SSDs are coming (see: ChromeOS, iOS, etc.), and in an SSD world, none of what follows will matter. Some of it could even hurt you. Other concerns — drive-internal parallelism, bus bandwidth/latency, and the amounts of crapware (that’s you, Sophos/Bit9/etc.) between your driver and your
LoadLibrary call — are going to dominate. Also note that what I’m about to say is app specific. If you want your app to start faster, collect data on real-world systems and go from there. Cargo-cult hacks at your own peril.
Right then, spinning disks.
Good programs start fast, and that includes when you start them just after the OS has started. In this state, many of the caches the OS populates on your behalf are unlikely to be full of goodies that’ll help you start faster, so you pay full freight for all of your dependencies. This is called a “cold start”. When you start cold you have to go get everything you need from disk. Usually spinning disk. With > 10ms average latencies for random reads. Ouch.
For a sense of scale, consider that warm starts of Chrome (on Windows) take just ~200ms. That’s ~200ms for both the browser and renderer processes. 20 random reads from disk double that. Last summer, depending on OS version, disk speed, and amount of Prefetch/SuperFetch training cold starts ranged from 2000-3700ms. Fortunately, browsers are the sorts of things that users tend to keep open, and if they close, re-open later in the same session. Cold start hurts, but it’s not most user’s primary experience of a browser.
But that doesn’t hold for a plugin, particularly one that isn’t already ubiquitous. Yes, Flash loads quickly, but for most sites it starts fast because some other site had previously loaded it. Cold start matters even more for Chrome Frame. You’ve already started your browser, so why is this page taking so long to come in? It’s not enough that the network response is being buffered and the page will feel instant when the renderer finally starts…we need pixels on screen stat. Hence began my summer of rebooting. We couldn’t ship Chrome Frame to Stable until cold start was sub-second.
Using custom logging via ETW as a starting point, I quickly found that most of the time spent in Chrome startup wasn’t going to subprocess or thread creation, cache loading, profile reading, or any of the other obvious big-ticket items. Tremendous discipline has been enforced in the Chromium codebase, ensuring that things which might otherwise block startup tend to happen asynchronously, keeping your experience responsive.
What’s left? ETW, Sawbuck and XPerf eventually showed that we were thrashing disk for lots of little 16-64K reads into
chrome.dll. Why not 4K, the size of single page? The Windows memory manager uses a read-ahead optimization when a hard fault occurs. That is to say, when your program tries to execute some bit of code from a library that isn’t yet in memory, the memory manager will go get the bits that correspond to the code in question, plus a little. Compilers optimize for code locality so you’re probably going to need the other pages too. Read-ahead (probably) saves you another expensive seek. If you don’t use those pages, no biggie. Windows is smart enough to prioritize caches. Lots of little 16-64K reads (aka: random I/O!) matched up with the hard page fault data…bingo. Even with SuperFetch doing its thing on modern Windows versions, we were seeing lots of slow (>10ms on 5400rpm disk) reads thanks to hard page faults.
Why were we getting these faults? To be entirely honest, we never quite figured this out. SuperFetch/PreFetch do their thing before your program ever hits it’s CRT main (the very first program-supplied entry point), meaning frequently used pages should already be available. Part of the reason they weren’t might be the different patterns of access between the Chrome Browser and Renderer processes. The same binary (
chrome.exe) and main library (
chrome.dll) are used for both, but the bits that each use are pretty radically different. There’s no access to WebKit code from the Browser process, e.g., but that’s most of what the Renderer uses. Other theories included some cap on the number of pages that SuperFetch will remember and log. Whatever the reason, fault location graphs showed a pretty violent thrashing toward the top end of
Idea time: how can we pull all of the pages into the standby list — the bit of the Virtual Memory cache with the lowest priority — with a single seek, i.e. one big sequential read? What can we do to slurp the entire DLL into memory so that when the program needs some bit of the library, it gets it from memory, not from spinning (read “slow”) disk?
We tried loading the library and walking the pages sequentially. No joy…well, not much. Certainly not enough. More on that in a second. What about pulling the DLL into the disk cache? Hrmmm. A quick test showed that this worked great on Windows Vista. Just
fread the sucker in and boom. ~750ms cold starts!
Except on XP. Balls.
Seemingly the memory manager bypasses the disk cache for faults on XP. Back to the bat-DLL-memory-walk approach, Batman! Yes, this works on XP, albiet with some caveats — namely that it’s slower than the other method. We still get stalls and slow seeks, but things improve to the 1300ms range. Average cold starts cut in half. We’ll take it. The final patch isn’t pretty, but it does something like what we want. See this file, look for
PreReadImage which gets used in Chromium’s
WinMain which does custom DLL loading here, and only on the first (parent, aka, browser) process that’s created. Voila! Really, really fast starts nearly everywhere.
So, is doing this a good idea?
It’s not a slam dunk. For users with SSDs, this is more data than they strictly need, possibly slowing starts marginally for them. We’ll probably have to revisit this at some point. Also, the Vista+ approach potentially pollutes the disk caches. Not great, but at least the pages in questions won’t cause unnecessary thrashing unless the system is already under memory pressure in which case, the impending startup operations (allocating enough heap for V8, renderer process startup, etc.) are going to hurt no matter what. Also, we’re not exactly sipping data here. As the size of
chrome.dll grows, so will the time required to pull it across the bus, even if the reads are serial. On my most recent dev channel rev, that’s a 23MB DLL, so if we get 75-100MB/s across the bus (optimistic for these old drives), we’re looking at a couple of hundred milliseconds just spent reading. Beats a tons of random seeks, but it’s still not great. Ideally we’d be able to use PGO/WPO to scope down the amount we really need to read by clumping early-use stuff towards the front of the binary, but so far this hasn’t panned out thanks to the aforementioned multi-process/single-dll thing. Turns out binary re-writing on Windows isn’t trivial. Perhaps the biggest down-side is that the XP experience isn’t ideal yet, and users on XP are most likely to have neither SSDs nor particularly speedy spinning disks. They need the help most. We were able to do something for them thanks to some dark magic in the Windows XP Prefetch system, but that’s a tale for another day. Suffice to say I now have tremendous respect for the folks who built these systems. Some parts of Windows truly are beautiful.
All in all, this hack has panned out in the real world in a pretty compelling way. Our start time histograms show the improvement in the real world (thanks to all of you who turned on reporting!), and Chrome Frame was able to ship to Stable last fall after we were relatively certain that we weren’t degrading overall system performance for users.
If you’ve read all of this, I hope that for your sake it’s with a sense of historical bemusement. What? They didn’t all have SSDs/quantum-rent-an-exabytes? Pssh. How did they get by? Well, now you know. One slow reboot of Windows at a time.
Bootnote: I should mention that it wasn’t as straight line to suss out what actually happens in a cold start, pin down causality, understand the differences between XP and Vista+, and convince ourselves with real-world data that this was really our best option. It took a long time. Also, at some point, I stopped borrowing Amit’s copies of Windows Internals 4 and Windows Internals 5 and bought my own. You’ll need both if you still work with XP. Recommended.
Footnote to the Bootnote: Obvious Windows Hacker question: why not just do something like read in the DLL from the GCF BHO when you start up? Y’all get your BHO component loaded at IE launch, after all.
Good thought, Obvious Windows Hacker! Sadly, this is evil…or at least not very sporting. Even with low-priority I/O on Vista+, it’s sort of punitive to go do tons of I/O near when the user is probably starting to browse sites that might not make use of our BHO-based goodness, don’t you think? Besides, if we can get an honest startup win, why cheat? Honest wins pay off for both Chrome and Chrome Frame, which is doubly awesome.