Performance Innumeracy & False Positives

February 3, 2011

tl;dr version: the web is waaaay too slow, and every time you write something off as "just taking a couple of milliseconds", you're part of the problem. Good engineering is about tradeoffs, and all engineering requires environmental assumptions -- even feature testing. In any case, there are good, reliable ways to use UA detection to speed up feature tests in the common case, which I'll show, and to which the generic arguments about UA vs. feature testing simply don't apply. We can and should go faster. Update: Nicholas Zackas explains it all, clearly, in measured form. Huzzah!

Performance Innumeracy

I want to dive into concrete strategies for low-to-no false positive UA matching for use in caching feature detection results, but first I feel I need to walk back to some basics since I've clearly lost some people along the way. Here are some numbers every developer (of any type) should know, borrowed from Peter Norvig's indispensable "Teach Yourself To Program In Ten Years":

Approximate timing for various operations on a typical PC:

execute typical instruction 1/1,000,000,000 sec = 1 nanosec

fetch from L1 cache memory 0.5 nanosec

branch misprediction 5 nanosec

fetch from L2 cache memory 7 nanosec

Mutex lock/unlock 25 nanosec

fetch from main memory 100 nanosec

send 2K bytes over 1Gbps network 20,000 nanosec

read 1MB sequentially from memory 250,000 nanosec

fetch from new disk location (seek) 8,000,000 nanosec

read 1MB sequentially from disk 20,000,000 nanosec

send packet US to Europe and back 150 milliseconds = 150,000,000 nanosec

That data's a bit old -- 8ms is optimistic for a HD seek these days, and SSD changes things -- but the orders of magnitude are relevant. For mobile, we also need to know:

fetch from flash storage	1,300,000 nanosec
60hz time slice	16,000,000 nanosec
send packet outside of a (US) mobile carrier network and back	80-800 milliseconds = 80,000,000 - 800,000,000 nanosec

The 60hz number is particularly important. To build UI that feels not just fast, but instantly responsive, we need to be yielding control back to our primary event loop in less than 16ms, all the time, every time. Otherwise the UI will drop frames and the act of clicking, tapping, and otherwise interacting with the app will seem "laggy" or "janky". Framing this another way, anything your webapp blocks on for more than 16ms is the enemy of solid, responsive UI.

Why am I blithering on and on about this? Because some folks continue to mis-prioritize the impact of latency and performance on user satisfaction. Google (my employer, who does not endorse this blog or my private statements in any way) has shown that seemingly minor increases in latency directly impact user engagement and that major increases in latency (> 500ms) can reduce traffic and revenue significantly. Latency then, along with responsiveness (do you drop below 60hz?), is a key metric for measuring the quality of an web experience. It's no accident that Google employs Steve Souders to help evangelize the cause of improving performance on the web, and has gone so far as to build products like Chrome & V8 who have as a core goal to the web faster. A faster web is a better web. Full stop.

That's why I get so deeply frustrated when we get straw-man based, data-challenged advocacy from the maintainers of important bits of infrastructure:

This stuff is far from easy to understand; even just the basics of feature detection versus browser detection are quite confusing to some people. That’s why we make libraries for this stuff (and, use browser inference instead of UA sniffing). These are the kind of efforts that we need, to help move the web forward as a platform; what we don’t need is more encouragement for UA sniffing as a general technique, only to save a couple of milliseconds. Because I can assure you that the Web never quite suffered, technologically, from taking a fraction of a second longer to load.

What bollocks. Not only did I not encourage UA sniffing "as a general technique", latency does in fact hurt sites and users -- all the time, every day. And we're potentially not talking about "a couple of milliseconds" here. Remember, in the context of mobile devices, the CPUs we're on are single-core and clocked in the 500mhz-1ghz range, which directly impacts the performance of single-threaded tasks like layout and JavaScript execution -- which by the way happen in the same thread. In my last post I said:

...if you’re a library author or maintainer, please please please consider the costs of feature tests, particularly the sort that mangle DOM and or read-back computed layout values

Why? Because many of these tests inadvertently force layout and style re-calculation. See for instance this snippet from has.js:

if(has.isHostType(input, "click")){
  input.type = "checkbox";
  input.style.display = "none";
  input.onclick = function(e){
    // ...
  };
  try{
    de.insertBefore(input, de.firstChild);
    input.click();
    de.removeChild(input);
  }catch(e){}
  // ...
}

Everything looks good. The element is display: none; so it shouldn't be generating render boxes when inserted into the DOM. Should be cheap, right? Well, lets see what happens in WebKit. Debugging into a simple test page with equivalent code shows that part of the call stack looks like:

#0	0x0266267f in WebCore::Document::recalcStyle at Document.cpp:1575
#1	0x02662643 in WebCore::Document::updateStyleIfNeeded at Document.cpp:1652
#2	0x026a89fd in WebCore::MouseRelatedEvent::receivedTarget at MouseRelatedEvent.cpp:152
#3	0x0269df03 in WebCore::Event::setTarget at Event.cpp:282
#4	0x026af889 in WebCore::Node::dispatchEvent at Node.cpp:2604
#5	0x026adbcb in WebCore::Node::dispatchMouseEvent at Node.cpp:2885
#6	0x026ae231 in WebCore::Node::dispatchSimulatedMouseEvent at Node.cpp:2816
#7	0x026ae3f1 in WebCore::Node::dispatchSimulatedClick at Node.cpp:2837
#8	0x02055bb5 in WebCore::HTMLElement::click at HTMLElement.cpp:767
#9	0x022587e6 in WebCore::HTMLInputElementInternal::clickCallback at V8HTMLInputElement.cpp:707
...

Document::recalcStyle() can be very expensive, and unlike painting, it blocks input and other execution. And the cost is at page loading is likely to be much higher than other times as there will be significantly more new styles streamed in from the network to satisfy for each element in the document when this is called. This isn't a full layout, but it's most of the price of one. Now, you can argue that this is a WebKit bug and I'll agree -- synthetic clicks should probably skip this -- but I'm just using this as an illustration to show that what browsers are doing on your behalf isn't always obvious. Once this bug is fixed, this test may indeed be nearly free, but it's not today. Not by a long shot.

Many layouts in very deep and "dirty" DOMs can take ten milliseconds or more, and if you're doing it from script, you're causing the system to do lots of work which it's probably going to need to throw away later when the rest of your markup and styles show up. Your average, dinky test harness page likely under-counts the cost of these tests, so when someone tells me "oh, it's only 30ms", not only do my eyes bug out at the double-your-execution-budget-for-anything number, but also the knowledge that in the real world, it's probably a LOT worse. Just imagine this happening in a deep DOM on a low-end ARM-powered device where memory pressure and a single core are conspiring against you.

False Positives

My last post concerned how you can build a cache to eliminate many of these problems if and only if you build UA tests that don't have false positives. Some commenters can't seem to grasp the subtlety that I'm not advocating for the same sort of lazy substring matching that has deservedly gotten such a bad rap.

So how would we build less naive UA tests that can have feature tests behind them as fallbacks? Lets look at some representative UA strings and see if we can't construct some tests for them that give us sub-version flexibility but won't pass on things that aren't actually the browsers in question:

IE 6.0, Windows:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)

FF 3.6, Windows:

Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.2.13) Firefox/3.6.13

Chrome 8.0, Linux:

Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Ubuntu/10.10 Chromium/8.0.552.237 Chrome/8.0.552.237 Safari/534.10

Safari 5.0, Windows:

Mozilla/5.0 (Windows; U; Windows NT 6.1; sv-SE) AppleWebKit/533.19.4 (KHTML, like Gecko) Version/5.0.3 Safari/533.19.4

Some features start to jump out at us. The "platform" clauses -- that bit in the parens after the first chunk -- contains a lot of important data and a lot of junk. But the important stuff always comes first. We'll need to allow but ignore the junk. Next, stuff after platform clauses is good, has defined order, and can be used to tightly form a match for browsers like Safari and Chrome. With this in mind, we can create some regexes that don't allow much in the way of variance but do allow sub-minor version to match so we don't have to update these every month or two:

IE60 = /^Mozilla\/4\.0 \(compatible; MSIE 6\.0; Windows NT \d\.\d(.*)\)$/;
FF36 = /^Mozilla\/5\.0 \(Windows; U;(.*)rv\:1\.9\.2.(\d{1,2})\)( Gecko\/(\d{8}))? Firefox\/3\.6(\.\d{1,2})?( \(.+\))?$/;
CR80 = /^Mozilla\/5\.0 \((Windows|Macintosh|X11); U;.+\) AppleWebKit\/534\.10 \(KHTML\, like Gecko\) (.+)Chrome\/8\.0\.(\d{3})\.(\d{1,3}) Safari\/534\.10$/;

These look pretty wordy, and they are, because they're designed NOT to let through things that we don't really understand. This isn't just substring matching on the word "WebKit" or "Chrome", this is a tight fit against the structure of the entire string. If it doesn't fit, we don't match, and our cache doesn't get pre-populated. Instead, we do feature detection. Remember, false positives here are the enemy, so we're using "^" and "$" matches to ensure that the string has the right structure all the way through, not just at some random point in the middle, which UA's that parade around as other browsers tend to do.

Here's some sample code that incorporates the approach:

(function(global){
// The map of available tests
var featureTests = {
"audio": function() {
var audio = document.createElement("audio");
return audio && audio.canPlayType;
},
"audio-ogg": function() { /.../ }
// ...
};
// A read-through cache for test results.
var testCache = {};
// An (exported) function to run/cache tests
global.ft = function(name) {
return testCache[name] = (typeof testCache[name] == "undefined") ?
featureTestsname :
testCache[name];
};
// Tests for 90+% of current browser usage
var ua = (global.navigator) ? global.navigator.userAgent : "";
// IE 6.0/WinXP:
var IE60 = /^Mozilla/4.0 (compatible; MSIE 6.0; Windows NT \d.\d(.))$/;
if (ua.search(IE60) == 0) {
testCache = { "audio": 1, "audio-ogg": 0 / ... */ };
}
// IE 7.0
// ...
// IE 8.0
// ...
// IE 9.0 (updated with fix from John-David Dalton)
var IE90 = /^Mozilla/5.0 (compatible; MSIE 9.0; Windows NT \d.\d(.))$/;
if (ua.search(IE90) == 0) {
testCache = { "audio": 1, "audio-ogg": 0 / ... */ };
}
// Firefox 3.6/Windows
var FF36 = /^Mozilla/5.0 (Windows; U;(.)rv:1.9.2.(\d{1,2}))( Gecko/(\d{8}))? Firefox/3.6(.\d{1,2})?( (.+))?$/;
if (ua.search(FF36) == 0) {
testCache = { "audio": 1, "audio-ogg": 1 / ... */ };
}
// Chrome 8.0
var CR80 = /^Mozilla/5.0 ((Windows|Macintosh|X11); U;.+) AppleWebKit/534.10 (KHTML, like Gecko) (.+)Chrome/8.0.(\d{3}).(\d{1,3}) Safari/534.10$/;
if (ua.search(FF36) == 0) {
testCache = { "audio": 1, "audio-ogg": 1 /* ... */ };
}
// Safari 5.0 (mobile)
var S5MO = /^Mozilla/5.0 (iPhone; U; CPU iPhone OS \w+ like Mac OS X; .+) AppleWebKit/(\d{3,}).(\d+).(\d+) (KHTML, like Gecko) Version/5.0(.\d{1,})? Mobile/(\w+) Safari/(\d{3,}).(\d+).(\d+)$/;
if (ua.search(FF36) == 0) {
testCache = { "audio": 1, "audio-ogg": 0 /* ... */ };
}
// ...
})(this);

New versions of browsers won't match these tests, so we won't break libraries in the face of new UAs -- assuming the feature tests also don't break, which is a big if in many cases -- and we can go faster for the majority of users. Win.

execute typical instruction	1/1,000,000,000 sec = 1 nanosec
fetch from L1 cache memory	0.5 nanosec
branch misprediction	5 nanosec
fetch from L2 cache memory	7 nanosec
Mutex lock/unlock	25 nanosec
fetch from main memory	100 nanosec
send 2K bytes over 1Gbps network	20,000 nanosec
read 1MB sequentially from memory	250,000 nanosec
fetch from new disk location (seek)	8,000,000 nanosec
read 1MB sequentially from disk	20,000,000 nanosec
send packet US to Europe and back	150 milliseconds = 150,000,000 nanosec