Benchmarking Is Hard: Reddit Edition

June 24, 2009

In which I partially defend Microsoft and further lament the state of tech "journalism".

A very short open letter:

Dear interwebs:
Please stop mis-representing the results of benchmarks. Or, at a minimum, please stop blogging the results in snide language that shows your biases. It makes the scientific method sad.
Thank you.
Alex Russell

Today's example of failure made manifest comes via Reddit's programming section (easy target, I know), but deserves some special attention thanks to such witty repartee as:

Using slow-motion video? What a great idea. Maybe we can benchmark operating systems like that.

Maybe we can....and maybe we should. It might yield improvements in areas of OS performance that impact user experience. With a methodology that represents end-user perception, you should be able to calculate the impact of different scheduling algorithms on UI responsiveness, something that desktop Linux has struggled with.

The test under mockery may have problems, but they're not the ones the author assumes. It turns out that watching for visual indications of "doneness" is a better-than-average way to judge overall browser performance (assuming fixed hardware, testing from multiple network topologies, etc.). After all, perceived performance in browsing is all that matters. No one discounts a website's performance because when you visit they happen to let browsers cache resources that get used across pages or because they use a CDN to improve resource loading parallelism. In the real world, anything you can do to improve the perceived latency of a web site or application is a win.

MSFT's test methodology (pdf) does a good job in balancing several factors that affect latency for end-users, including resources that are loaded after onload or in sub-documents, potential DNS lookup timing issues, and the effects of network-level parallelism on page loading. Or at least it would in theory. The IE team's published methodology is silent on points such as how and where DNS caches may be in play and what was done to mitigate them, but the level of overall rigor is quite good.

So what's wrong with the MSFT test? Not much, except that they didn't publish their code or make the test rig available for new releases of browsers to be run against. As a result, the data is more likely to be incorrect because it's stale than to be incorrect due to methodology problems. New browser versions are being released all the time, rendering the conclusions from the Microsoft study already obsolete. Making the tests repeatable by opening up the test rig or filling in the gaps in the methodology would fix that issue while lending the tests the kind of credibility that the Sun Spider and V8 benchmarks now enjoy.

This stands in stark opposition to this latest "benchmark". Indeed, while the source code was posted, it only deepens my despair. By loading the "real world sites" from a local copy, much of the excellent work being done to improve browser performance at the network level is totally eliminated. Given the complexity of real-world sites and the number of resources loaded by say, Facebook.com, changes that eliminate the effects of the network make the tests highly suspect. While excoriating JavaScript benchmarks as not representing the real world accurately, the test author eliminated perhaps the largest contributor to page loading latency and perceived performance. Ugg.

Instead of testing real-world websites (where network topology and browser networking makes a difference), the author tested local, "dehydrated" versions of websites. The result is that "loading times" weren't tested, but rather a test of "local resource serving times and site-specific optimizations around the onload event" was run . Testing load times would have accounted for resources loaded after the onload event fired, too. There's reason to think that neither time to load from local disk nor time for a page to fire the onload handler dominate (or even indicate) real world performance.

I'm grateful that this test showed that Chrome loads and renders things quickly from local disk. I also have no doubt that Chrome loads real websites very quickly, but this test doesn't speak to that.

It's frustrating that the Reddits and Slashdots of the world have such poor collective memory and such faulty filtering that they can't seem to keep themselves from promoting these types of bias re-enforcing stories on a regular basis. Why, oh why, can't we have better tech journalism?