Benchmarking Is Hard: Reddit Edition

In which I partially defend Microsoft and further lament the state of tech “journalism”.

A very short open letter:

Dear interwebs:

Please stop mis-representing the results of benchmarks. Or, at a minimum, please stop blogging the results in snide language that shows your biases. It makes the scientific method sad.

Thank you.

Alex Russell

Today’s example of failure made manifest comes via Reddit’s programming section (easy target, I know), but deserves some special attention thanks to such witty repartee as:

Using slow-motion video? What a great idea. Maybe we can benchmark operating systems like that.

Maybe we can….and maybe we should. It might yield improvements in areas of OS performance that impact user experience. With a methodology that represents end-user perception, you should be able to calculate the impact of different scheduling algorithms on UI responsiveness, something that desktop Linux has struggled with.

The test under mockery may have problems, but they’re not the ones the author assumes. It turns out that watching for visual indications of “doneness” is a better-than-average way to judge overall browser performance (assuming fixed hardware, testing from multiple network topologies, etc.). After all, perceived performance in browsing is all that matters. No one discounts a website’s performance because when you visit they happen to let browsers cache resources that get used across pages or because they use a CDN to improve resource loading parallelism. In the real world, anything you can do to improve the perceived latency of a web site or application is a win.

MSFT’s test methodology (pdf) does a good job in balancing several factors that affect latency for end-users, including resources that are loaded after onload or in sub-documents, potential DNS lookup timing issues, and the effects of network-level parallelism on page loading. Or at least it would in theory. The IE team’s published methodology is silent on points such as how and where DNS caches may be in play and what was done to mitigate them, but the level of overall rigor is quite good.

So what’s wrong with the MSFT test? Not much, except that they didn’t publish their code or make the test rig available for new releases of browsers to be run against. As a result, the data is more likely to be incorrect because it’s stale than to be incorrect due to methodology problems. New browser versions are being released all the time, rendering the conclusions from the Microsoft study already obsolete. Making the tests repeatable by opening up the test rig or filling in the gaps in the methodology would fix that issue while lending the tests the kind of credibility that the Sun Spider and V8 benchmarks now enjoy.

This stands in stark opposition to this latest “benchmark”. Indeed, while the source code was posted, it only deepens my despair. By loading the “real world sites” from a local copy, much of the excellent work being done to improve browser performance at the network level is totally eliminated. Given the complexity of real-world sites and the number of resources loaded by say, Facebook.com, changes that eliminate the effects of the network make the tests highly suspect. While excoriating JavaScript benchmarks as not representing the real world accurately, the test author eliminated perhaps the largest contributor to page loading latency and perceived performance. Ugg.

Instead of testing real-world websites (where network topology and browser networking makes a difference), the author tested local, “dehydrated” versions of websites. The result is that “loading times” weren’t tested, but rather a test of “local resource serving times and site-specific optimizations around the onload event” was run . Testing load times would have accounted for resources loaded after the onload event fired, too. There’s reason to think that neither time to load from local disk nor time for a page to fire the onload handler dominate (or even indicate) real world performance.

I’m grateful that this test showed that Chrome loads and renders things quickly from local disk. I also have no doubt that Chrome loads real websites very quickly, but this test doesn’t speak to that.

It’s frustrating that the Reddits and Slashdots of the world have such poor collective memory and such faulty filtering that they can’t seem to keep themselves from promoting these types of bias re-enforcing stories on a regular basis. Why, oh why, can’t we have better tech journalism?

15 Comments

  1. Posted June 24, 2009 at 7:37 pm | Permalink

    Hello, thanks for the response.

    I will have to say though that slow motion benchmarks on a browser is useless because I can tell just by trying to interact with the page before onload triggers, that the whole browser is likely to be locked up. This is not a very useful standard to see how fast a browser paints a portion of a page.

    As for “local dehydrated versions of tests” the point of a test is to isolate the variables in which would be likely to cause high variance.

    Your browser has nothing to do with the speed of your network or the server. Page loading times will often surpass DNS resolution times due to the fact that many sites you visit will be cached.

    I find that I can already interact with the page after the onload event has fired. Isn’t this good enough?

  2. Posted June 24, 2009 at 7:48 pm | Permalink

    Howdy:

    A couple of quick points:

    • Noting that things are “locked up” before onload says nothing about what might be causing a page to be unresponsive after onload. Your methodology doesn’t catch any of those issues.
    • Nearly all browsers implement a “progressive rendering” algorithm that causes the page to take longer to finally load and render, but provides an interactive UI in the interim. Your methodology doesn’t actually test to find out when (in the course of page loading), one can being to meaningfully interact with the page, so making claims about when things are “locked up” or not doesn’t hold water
    • One of the major improvements in Chrome’s initial release was an implementation of DNS pre-fetching. It has a large impact on real-world performance. Similarly, IE 8 and Chrome 2.0 both implemented concurrent script downloading, dramatically improving page loading performance for pages (like Facebook) that are script heavy and include many resources
    • Using local pages eliminates any potential differences in the effect of network-level request parallelism. For example, IE 7 (not tested) allows 2 network connections per host (via HTTP 1.1), whereas IE 8 bumps the limit to 8. This has large implications for page loading performance when using CDNs

    If the goal is to test the real world performance of browsers against pages as browsers load them, you should test that. If the goal is to test an isolated component of a browser (like the single-function benchmarks you derided), then you should make clear what parts of the browser you’re attempting to stress and eliminate sources of error.

    Regards

  3. Posted June 24, 2009 at 8:09 pm | Permalink

    -“Noting that things are “locked up” before onload says nothing about what might be causing a page to be unresponsive after onload. Your methodology doesn’t catch any of those issues.”

    I didn’t notice any instance where the page became unresponsive after onload.

    -“Your methodology doesn’t actually test to find out when (in the course of page loading), one can being to meaningfully interact with the page, so making claims about when things are “locked up” or not doesn’t hold water”

    I used both DOMContentLoaded and onreadyStateChange (for IE) and none of them allowed for any meaningful interaction with the page (Despite their descriptions). Therefore I decided to go with onload because it is a standard in which the definition defines an interactive page.

    -“One of the major improvements in Chrome’s initial release was an implementation of DNS pre-fetching. It has a large impact on real-world performance.”

    It only has a one time impact. Once you’ve visited the page, your DNS cache is likely to hold the query for future reference. Assuming you are loading a new page, the page loading times are likely to be __far__ higher than DNS. The DNS protocol is very lightweight compared to HTTP.

    -“Similarly, IE 8 and Chrome 2.0 both implemented concurrent script downloading, dramatically improving page loading performance for pages (like Facebook) that are script heavy and include many resources”

    I think all the modern browsers have concurrent connections by now.

    -“If the goal is to test an isolated component of a browser (like the single-function benchmarks you derided)”

    My goal was to create the most meaningful browser benchmark to date. Unfortunately, you cannot have a meaningful benchmark across a network when the network delays can be as big as the page loading times.

    If there was a way to standardize an internet connection, I would do it. But I really don’t think you would find a big difference even if you stored the pages on a LAN to emulate the internet.

  4. Posted June 24, 2009 at 8:24 pm | Permalink

    It’s worth reading the Chromium blog post about DNS prefetching:

    http://blog.chromium.org/2008/09/dns-prefetching-or-pre-resolving.html

    The graphs are in there fore a reason.

    And then reading up on DNS TTL’s:

    http://en.wikipedia.org/wiki/Time_to_live

    Regards

  5. Coheed
    Posted June 25, 2009 at 10:42 am | Permalink

    My only problem with Codexon’s benchmark was that it appears havenworks.com, a site that IE took forever to render, was included solely for the purpose of ensuring Firefox wasn’t dead last. Take it out of the benchmark and IE would have surpassed Firefox slightly in speed.

  6. Posted June 25, 2009 at 11:21 am | Permalink

    Coheed:

    I think it’s legit to include pathological cases in a benchmark…if things are bottlenecks, well, then they’re bottlenecks. What’s difficult about it is that no argument is made for why the site in question is representative of some broad class of pathological sites that hurt you in the real world. It might be, but who knows? That’s enough (to my mind) to eliminate it from a list which might have been otherwise populated by, say, the Alexa Top 100 sites or something.

    The biggest failings to my mind are around methodology. The site appears to be down for me, but IIRC, the times reported didn’t include a # of test runs, median and mean values, a discussion of how outliers were handled or diagnosed (discarded? included and reported as standard deviation?), or any real attempt to eliminate contention on local disk (since this was a test of local disk which was being used both by browsers for caching and storage an for the server or file-based I/O to run the tests, after all).

    Lastly, it didn’t test released versions of browsers and it didn’t discuss how error related to extensions might be eliminated or accounted for.

    Maybe what the world needs is a “how to produce and present credible benchmarks” document?

    Regards

  7. Posted June 25, 2009 at 12:20 pm | Permalink

    …and the page is back up. It does seem that they repeated the tests 10x each and reported the mean. Not great, but at least the tests were run multiple times.

  8. glop
    Posted June 25, 2009 at 1:10 pm | Permalink

    “the kind of credibility that the Sun Spider and V8 benchmarks now enjoy”

    Credibility that is not at all deserved, considering that these tests were created specifically to perform well in specific browsers, and are basically testing tiny parts of JS and are not relevant to overall performance at all, especially since JS only makes up a tiny part of even the JS-heaviest sites today.

  9. Posted June 25, 2009 at 1:20 pm | Permalink

    glop:

    I think it’s totally fair to argue over the relative merit of the V8 and Sun Spider benchmarks as they relate to a real-world web workload. What I meant by that statement wasn’t that they measured everything, but rather that they purport to measure specific things and do it well. They use a strong methodology, are reproducible, and don’t make claims about what they test that can’t be backed up by the tests themselves. In that sense, they’re much better than the test under discussion.

    Regards

  10. glop
    Posted June 25, 2009 at 1:34 pm | Permalink

    Maybe the actual test pages themselves make accurate claims, but the portrayal in the media and through Google and Apple’s marketing is that they somehow reflect overall browser performance rather than one of the least important aspects on the web to day.

  11. Posted June 25, 2009 at 2:17 pm | Permalink

    glop:

    I can’t speak for – and won’t speak to – Apple’s marketing. What I can say is that if you look through the Chrome Blog (http://chrome.blogspot.com/), you’ll see that what’s claimed with relationship to JS speed is only ever that. Does it matter in the real world? Depends on your app. Systems like GMail get a lot more out of the performance improvement in V8 than, say, the Google home page does. I’ll agree that the media tend to distort what’s claimed to everyone’s great frustration, given the care involved in how the results are carefully gathered and presented.

    You might be able to uncharitably argue that this distortion might be seen as a “win” somehow for Chrome marketing, but I can assure you that inaccuracy hurts everyone, particularly since what matters at the end of the day to end-users is how fast the web feels *to them*. No one installs a browser and keeps using it based on a benchmark. Placebo effects wear off, so there’s no gain in a quick-hit PR win that’s not backed up by your day-to-day experience of the browser.

    All of that speaks to why I praised the IE test methodology (while noting their stale results). Getting a reliable measure of end-user perceived performance is good for everyone, and I don’t think you’ll find anyone on the Chrome team arguing to the contrary. All this post is arguing is that the presented benchmark doesn’t even get close to that mark.

    Regards

  12. Posted June 25, 2009 at 6:37 pm | Permalink

    Hello Alex. It is undoubtly true that DNS queries can be long.

    But it is a one time in a while cost.

    According to the Wikipedia article you linked, the common DNS TTL is 24 hours. DNS is just not that slow compared to everything else.

    Take OpenDNS for example:

    Cached Query
    dig @208.67.222.222 dojotoolkit.org
    Trial 1: Query time: 7 msec
    Trial 2: Query time: 8 msec

    Uncached Query
    dig @208.67.222.222 fkadsjkfjksadjfkjasdf.com
    Trial 1: Query time: 118 msec
    Trial 2: Query time: 7 msec

    As you can see, DNS query times become quickly negligible.

    I am willing to bet that most people visit the same few websites over and over again rather than 100 different websites that don’t have their DNS cached.

  13. Posted June 25, 2009 at 9:39 pm | Permalink

    It’s great that the top 25 sites are fast in IE8. But the web is changing.

    Sites using the canvas tag throw a hard punch to IE because it doesn’t directly support it.

    It seems to be the RIA apps and other leading-edge work that IE falls apart on. But I guess Microsoft’s solution there would be Silverlight, not IE8.

  14. lern_too_spel
    Posted June 26, 2009 at 10:00 am | Permalink

    There is actually something even more wrong with the benchmark. onload is underspecified, and many browsers fire it before the page is rendered. Webkit has an open bug in which it fires onload even before all the resources are loaded.

    On your dig against reddit: This point was actually mentioned in the benchmark post’s discussion but downvoted into oblivion by a troll. http://www.reddit.com/r/programming/comments/8vd9s/finally_a_browser_benchmark_that_tests_real/c0ak3au

  15. glop
    Posted June 29, 2009 at 11:33 am | Permalink

    Actually, the JS engines in Chrome and Safari have very little impact on even Gmail. There are other bottlenecks on the site that would have been much more noticeable if dealt with, but for some reason they chose to narrowly focus on JavaScript, which isn’t even close to being the main culprit even on Gmail.