Benchmarking Is Hard: Reddit Edition

Comments for Benchmarking Is Hard: Reddit Edition

June 24, 2009

It's great that the top 25 sites are fast in IE8. But the web is changing.

Sites using the canvas tag throw a hard punch to IE because it doesn't directly support it.

It seems to be the RIA apps and other leading-edge work that IE falls apart on. But I guess Microsoft's solution there would be Silverlight, not IE8.

Hello Alex. It is undoubtly true that DNS queries can be long.

But it is a one time in a while cost.

According to the Wikipedia article you linked, the common DNS TTL is 24 hours. DNS is just not that slow compared to everything else.

Take OpenDNS for example:

Cached Query dig @208.67.222.222 dojotoolkit.org Trial 1: Query time: 7 msec Trial 2: Query time: 8 msec

Uncached Query dig @208.67.222.222 fkadsjkfjksadjfkjasdf.com Trial 1: Query time: 118 msec Trial 2: Query time: 7 msec

As you can see, DNS query times become quickly negligible.

I am willing to bet that most people visit the same few websites over and over again rather than 100 different websites that don't have their DNS cached.

glop:

I can't speak for - and won't speak to - Apple's marketing. What I can say is that if you look through the Chrome Blog (http://chrome.blogspot.com/), you'll see that what's claimed with relationship to JS speed is only ever that. Does it matter in the real world? Depends on your app. Systems like GMail get a lot more out of the performance improvement in V8 than, say, the Google home page does. I'll agree that the media tend to distort what's claimed to everyone's great frustration, given the care involved in how the results are carefully gathered and presented.

You might be able to uncharitably argue that this distortion might be seen as a "win" somehow for Chrome marketing, but I can assure you that inaccuracy hurts everyone, particularly since what matters at the end of the day to end-users is how fast the web feels to them. No one installs a browser and keeps using it based on a benchmark. Placebo effects wear off, so there's no gain in a quick-hit PR win that's not backed up by your day-to-day experience of the browser.

All of that speaks to why I praised the IE test methodology (while noting their stale results). Getting a reliable measure of end-user perceived performance is good for everyone, and I don't think you'll find anyone on the Chrome team arguing to the contrary. All this post is arguing is that the presented benchmark doesn't even get close to that mark.

Regards

My only problem with Codexon's benchmark was that it appears havenworks.com, a site that IE took forever to render, was included solely for the purpose of ensuring Firefox wasn't dead last. Take it out of the benchmark and IE would have surpassed Firefox slightly in speed.

Coheed:

I think it's legit to include pathological cases in a benchmark...if things are bottlenecks, well, then they're bottlenecks. What's difficult about it is that no argument is made for why the site in question is representative of some broad class of pathological sites that hurt you in the real world. It might be, but who knows? That's enough (to my mind) to eliminate it from a list which might have been otherwise populated by, say, the Alexa Top 100 sites or something.

The biggest failings to my mind are around methodology. The site appears to be down for me, but IIRC, the times reported didn't include a # of test runs, median and mean values, a discussion of how outliers were handled or diagnosed (discarded? included and reported as standard deviation?), or any real attempt to eliminate contention on local disk (since this was a test of local disk which was being used both by browsers for caching and storage an for the server or file-based I/O to run the tests, after all).

Lastly, it didn't test released versions of browsers and it didn't discuss how error related to extensions might be eliminated or accounted for.

Maybe what the world needs is a "how to produce and present credible benchmarks" document?

Regards

Maybe the actual test pages themselves make accurate claims, but the portrayal in the media and through Google and Apple's marketing is that they somehow reflect overall browser performance rather than one of the least important aspects on the web to day.

Hello, thanks for the response.

I will have to say though that slow motion benchmarks on a browser is useless because I can tell just by trying to interact with the page before onload triggers, that the whole browser is likely to be locked up. This is not a very useful standard to see how fast a browser paints a portion of a page.

As for "local dehydrated versions of tests" the point of a test is to isolate the variables in which would be likely to cause high variance.

Your browser has nothing to do with the speed of your network or the server. Page loading times will often surpass DNS resolution times due to the fact that many sites you visit will be cached.

I find that I can already interact with the page after the onload event has fired. Isn't this good enough?

-"Noting that things are “locked up” before onload says nothing about what might be causing a page to be unresponsive after onload. Your methodology doesn’t catch any of those issues."

I didn't notice any instance where the page became unresponsive after onload.

-"Your methodology doesn’t actually test to find out when (in the course of page loading), one can being to meaningfully interact with the page, so making claims about when things are “locked up” or not doesn’t hold water"

I used both DOMContentLoaded and onreadyStateChange (for IE) and none of them allowed for any meaningful interaction with the page (Despite their descriptions). Therefore I decided to go with onload because it is a standard in which the definition defines an interactive page.

-"One of the major improvements in Chrome’s initial release was an implementation of DNS pre-fetching. It has a large impact on real-world performance."

It only has a one time impact. Once you've visited the page, your DNS cache is likely to hold the query for future reference. Assuming you are loading a new page, the page loading times are likely to be far higher than DNS. The DNS protocol is very lightweight compared to HTTP.

-"Similarly, IE 8 and Chrome 2.0 both implemented concurrent script downloading, dramatically improving page loading performance for pages (like Facebook) that are script heavy and include many resources"

I think all the modern browsers have concurrent connections by now.

-"If the goal is to test an isolated component of a browser (like the single-function benchmarks you derided)"

My goal was to create the most meaningful browser benchmark to date. Unfortunately, you cannot have a meaningful benchmark across a network when the network delays can be as big as the page loading times.

If there was a way to standardize an internet connection, I would do it. But I really don't think you would find a big difference even if you stored the pages on a LAN to emulate the internet.

It's worth reading the Chromium blog post about DNS prefetching:

http://blog.chromium.org/2008/09/dns-prefetching-or-pre-resolving.html

The graphs are in there fore a reason.

And then reading up on DNS TTL's:

http://en.wikipedia.org/wiki/Time_to_live

Regards

There is actually something even more wrong with the benchmark. onload is underspecified, and many browsers fire it before the page is rendered. Webkit has an open bug in which it fires onload even before all the resources are loaded.

On your dig against reddit: This point was actually mentioned in the benchmark post's discussion but downvoted into oblivion by a troll. http://www.reddit.com/r/programming/comments/8vd9s/finally_a_browser_benchmark_that_tests_real/c0ak3au

glop:

I think it's totally fair to argue over the relative merit of the V8 and Sun Spider benchmarks as they relate to a real-world web workload. What I meant by that statement wasn't that they measured everything, but rather that they purport to measure specific things and do it well. They use a strong methodology, are reproducible, and don't make claims about what they test that can't be backed up by the tests themselves. In that sense, they're much better than the test under discussion.

Regards

Howdy:

A couple of quick points:

Noting that things are "locked up" before onload says nothing about what might be causing a page to be unresponsive after onload. Your methodology doesn't catch any of those issues.
Nearly all browsers implement a "progressive rendering" algorithm that causes the page to take longer to finally load and render, but provides an interactive UI in the interim. Your methodology doesn't actually test to find out when (in the course of page loading), one can being to meaningfully interact with the page, so making claims about when things are "locked up" or not doesn't hold water
One of the major improvements in Chrome's initial release was an implementation of DNS pre-fetching. It has a large impact on real-world performance. Similarly, IE 8 and Chrome 2.0 both implemented concurrent script downloading, dramatically improving page loading performance for pages (like Facebook) that are script heavy and include many resources
Using local pages eliminates any potential differences in the effect of network-level request parallelism. For example, IE 7 (not tested) allows 2 network connections per host (via HTTP 1.1), whereas IE 8 bumps the limit to 8. This has large implications for page loading performance when using CDNs

If the goal is to test the real world performance of browsers against pages as browsers load them, you should test that. If the goal is to test an isolated component of a browser (like the single-function benchmarks you derided), then you should make clear what parts of the browser you're attempting to stress and eliminate sources of error.

Regards

"the kind of credibility that the Sun Spider and V8 benchmarks now enjoy"

Credibility that is not at all deserved, considering that these tests were created specifically to perform well in specific browsers, and are basically testing tiny parts of JS and are not relevant to overall performance at all, especially since JS only makes up a tiny part of even the JS-heaviest sites today.

...and the page is back up. It does seem that they repeated the tests 10x each and reported the mean. Not great, but at least the tests were run multiple times.

Actually, the JS engines in Chrome and Safari have very little impact on even Gmail. There are other bottlenecks on the site that would have been much more noticeable if dealt with, but for some reason they chose to narrowly focus on JavaScript, which isn't even close to being the main culprit even on Gmail.