Census 2: More Than Just A Pretty Graph

Benchmarks are hard, particularly for complex systems. As a result, the most hotly contested benchmarks tend not to be representative of what makes systems faster for real users. Does another 10% on TPC really matter to most web developers? And should we really pay any attention to how any JS VM does on synthetic language benchmarks?

Maybe.

These things matter only in regards to how well they represent end-user workloads and how trustworthy their findings are. The first is much harder than the second, and end-to-end benchmarking is pretty much the only way to get there. As a result, sites like Tom’s Hardware focus on application-level benchmarks while still publishing “low level” numbers. Venerable test suites like SPECint have even moved toward running “full stack” style benchmarks which may emphasize a particular workload but are broad enough to capture the wider system effects which matter in the real world.

Marketing departments also like small, easily digestible, whole numbers. Saying something like “200% Faster!” sure sounds a lot better than “on a particular test which is part of a larger suite of tests, our system ran in X time vs. Y time for the competitor”. Both may be true, but the second statement gives you some context. Preferably even that statement would occur above an actual table of numbers or graphs. Numbers without context are lies waiting to be repeated.

With all of this said, James Ward’s Census benchmark makes a valiant stab at a full-stack test of data loading and rendering performance for RIA technologies. Last month Jared dug further into the numbers and found the methodology wanting, but given some IP issues couldn’t patch the sources himself. Since I wasn’t encumbered in the same way I thought I might as well try my hand at it, but after hours of attempting to get the sources to build, I finally gave up and decided to re-write the tests. The result is Census 2.

There are several goals of this re-write:

  • Fairness. Tests need to be run multiple times for them to be representative in any way. Likewise, systems not being directly tested need to be factored out as much as possible. C2 does this by reducing the number of dependencies and running tests (at least) 5 times and discarding outliers before reporting an average. I’ve also worked to make sure that the tests put the best foot forward for all of the tested technologies.
  • Hackability. Benchmarks like Census serve first as a way for decision makers to understand options but second as a way for developers to know how they’re doing. Making it trivial to add tests helps both audiences.
  • Portability. The test suite should run nearly everywhere with a minimum of setup and fuss. This ensures that the largest numbers of people can benefit from the fairness and hackability of the tests.

The results so far have been instructive. On smaller data sets HTML wins hands-down for time-to-render, even despite its disadvantage in over-the-wire size. For massive data sets, pagination saves even the most feature-packed of RIA Grids, allowing the Dojo Grid to best even XSLT and a more compact JSON syntax. Of similar interest is the delta between page cycle times on modern browsers vs their predecessors. Flex can have a relatively even performance curve over host browsers, but the difference between browsers today is simply stunning.

Given the lack of an out-of-the-box paginating data store for Flex, RIAs built on that stack seem beholden to either Adobe’s LCDS licensing or are left to build ad-hoc pagination into apps by hand to get reasonable performance for data-rich business applications. James Ward has already exchanged some mail with me on this topic and it’s my hope that we can show how to do pagination in Flex without needing LCDS in the near future.

The tests aren’t complete. There’s still work to do to get some of the SOAP and AMF tests working again. If you have ideas about how to get this done w/o introducing a gigantic harball of a Java toolchain, I’m all ears. Also on the TODO list is an AppEngine app for recording and analyzing test runs so that we can say something interesting about performance on various browsers.

Census 2 is very much an open source project and so if you’d like to get your library or technology tested, please don’t hesitate to send me mail or, better yet, attach patches to the Bug Tracker.

Update: I failed to mention earlier that one of the largest changes in C2 vs. Census is that we report full page cycle times. Instead of reporting just the “internal” timings of an RIA which has been fully boostrapped, the full page times report the full time from page loading to when the output is responsive to user action. This keeps JavaScript frameworks (or even Flex) from omitting from the reports the price that users pay to download their (often sizable) infrastructure. There’s more work to do in reporting overall sizes and times (“bandwidth” numbers don’t report gzipped sizes, e.g.), but if you want the skinny on real performance, scroll down to the red bars. That’s where the action is.

11 Comments

  1. Posted December 16, 2008 at 10:33 am | Permalink

    FWIW, for widgets (like the grid) that make extensive use of getValue calls, the ServiceStore/JsonRestStore tends to be much faster, especially compared to the ItemFileReadStore/ItemFileWriteStore and probably a lot faster than QueryReadStore as well (although QRS might be faster on actual data parsing times, not sure). Using those stores might make Dojo look a little better.

  2. Posted December 16, 2008 at 12:26 pm | Permalink

    Kris:

    So I’d love to add a ServiceStore test. The overhead of Item wrapping in IFRS and even QRS has bugged me for some time, so showing how SS or JRS stack up would be a huge boon to the test suite. I can add you to the project if you like.

    Regards

  3. Posted December 16, 2008 at 1:19 pm | Permalink

    its great such benchmarking is done, but pleaaase provide a summary, most people are not interested in the technology/metholodgy or what keeps those tests running, just say:
    blabla
    results: Flex is faster than AJAX on IE + Opera, but not on FF

    There is no single finished graph or image in this post, and the only image links to dojofoundation, very helpful.

    What i learned here: there are several technologies, but no idea which is best/good in which scenario, and running those tests on my machine, while there is emul and whatnot running will not provide any reliable data.

  4. Posted December 16, 2008 at 2:12 pm | Permalink

    @Alex: yeah, go ahead and add me, I’ll add it when I have a chance.

  5. Posted December 16, 2008 at 2:47 pm | Permalink

    grosser:

    The problem, until we get the AppEngine reporting back-end done, is that it’s not that simple. As with most benchmarks, having one big headline number is perhaps the simplest way to get your benchmark to tell lies (even when that’s totally unintentional).

    This benchmark doesn’t output a single big number because doing so wouldn’t be honest. Numbers without context are lies waiting to be repeated.

    I’ve fixed the link from the image. Sorry about that!

    Regards

  6. Posted December 17, 2008 at 1:41 am | Permalink

    This is great news.
    I will have to take a close look at it.

    Note that BlazeDS and I assume LCDS as well do bundle HTTP requests using an unspecified heuristics(seems to depend on the time between requests).

    This might work very well to get better benchmarks numbers, but in practice it might not be that useful because currently one cannot influence the heuristics used, because it’s done in native code.

    Regards,
    Markus

  7. Patrick Whittingham
    Posted December 17, 2008 at 9:30 am | Permalink

    It is recommended by Adobe to use Remoting for more than 1000 rows. There are tons of Flex paging routines on the net, but mostly paging is done on the client making it super fast. I’ve create a sample flex app to bring back a ‘static’ csv file (1k,10k,100k,250k) with httpservice and it can load 10k csv file in 60ms and parse it in 15ms.

    Here is my sample:
    =========================================

    1
    10
    100
    250
    <!–assets/1_million.csv–>

  8. Posted December 17, 2008 at 9:30 am | Permalink

    Stupendous stuff Alex. Having a tool which reports back its results for later analysis is indeed awesome. Can the test suite be invoked on a headless machine running Selenium RC?

  9. Isaac Gouy
    Posted December 17, 2008 at 10:53 am | Permalink

    > And should we really pay any attention to how any
    > JS VM does on synthetic language benchmarks?

    It’s a little strange to see the benchmarks game linked in this context, seems like SunSpider would be more to the point.

  10. Posted December 18, 2008 at 10:01 am | Permalink

    I was looking at the descriptions of the XSLT and JSON to HTML tests and was wondering a few things. In the case of XSLT the time taken to retrieve the XSLT is included in the test, however, that can also be included as a string in the JavaScript code so there is no overhead of the extra request. In many cases XSLT is converted to a JavaScript string in a build process to avoid the second requets.

    Also, in the JSON to HTML description it says “This method is relatively parsimonious with network resources due to the reduced amount of encoding cruft inherent in JSON versus HTML or XML.” yet that all depends on how you write your XML. Saying that { person : { name : “Luke”, zip_code : “90210″ } has less cruft than is pretty misleading.

    Otherwise, I am very happy to see the inclusion of XSLT and it was something that I was going to add onto James’ work myself!

  11. Posted December 18, 2008 at 11:38 am | Permalink

    Hi Dave:

    You’re right that I could inline the XSLT into the served page, but I’m not sure that it would be representative. Part of the performance advantage of XSLT is that the XSLT doc can be cached across page views whereas the data might change. This allows formatting, which might otherwise balloon the size of content, to be factored out and served off of CDNs. It’s true that in this case we’re requesting it synchronously, and that’s not representative. It is, however, cached across page views which would not be the case if it were served inline.

    As for the size of XML vs. JSON, even a smaller XML syntax which is still semantically meaningful would still be larger than the comprable JSON. I’ll leave this as an exercise to you. I do plan on adding a terser XML format (the current one is an artifact of the original Census app), but it’ll still be bigger. By how much is an important question, though.

    Regards

2 Trackbacks

  1. By Ajaxian » Census 2: Benchmarking RIAs Rebooted on December 16, 2008 at 11:23 am

    [...] Russell decided to rewrite and create Census 2 to act as a new benchmark for various RIA techniques. This is based on the the original Census [...]

  2. [...] co-creator Alex Russell announced Census 2, a project to more accurately compare performance between Ajax and Flex [...]