In which you talk me into finally getting a Twitter account by explaining to me why I don't understand Twitter.
I'm a Twitter luddite for perhaps the most pedantic of excuses: for years I've scratched my head at why what seemed like a solved problem has eluded Twitter in its search for scale with stability. A new presentation by Twitter engineer Raffi Krikorian deepens my confusion. First the numbers:
|Avg. Inbound Tweets / Second
|Max. Inbound Tweets / Second
|Tweet Size (bytes)
|Registered Users (M)
|Max Fanout (M)
Social networks like Twitter are just that -- networks -- and to understand Twitter as a network we want to know how much traffic the Twitter "backbone" is routing. Knowing that Twitter does 800 messages inbound per second doesn't tell us but an estimate is possible. From a talk last year by another Twitter engineer, we know that Twitter users have less than 200 followers on average. That means that despite the eye-popping 6.1M follower (in networking terms "fanout") count for Lady GaGa, we should expect most tweets to generate significantly less load. Dealing just in averages, we should expect baseline load to be roughly 100K delivery attempts per second. Peak traffic is likely less than 1.5M delivery attempts per second (4K senders w/ double the average connectedness plus some padding for high-traffic outliers).
Knowing that peak loads are 4x average loads is useful and we can provision based on that. We also know that Twitter doesn't guarantee message order and has no SLA for delivery which means we can deal with the Lady GaGa case by smearing delivery for users with huge fanout, ordering by something smart (most active users get messages first?). Heck, Twitter doesn't even guarantee delivery, so we could even go best-effort if the system is congested, taking total load into account for the smear size of large senders or recovering out of band later by having listeners que a DB. So far our requirements are looking pretty sweet. Twitter's constraints significantly ease the engineering challenge for the core routing and delivery function (the thing that should never be down).
What about tweet size? How much will an individual tweet tax a network? Can we handle tweets as packets? Tweet text is clamped to 200 bytes (as per Raffi's slides) but Tweets now support extra metadata. The Twitter API Wiki notes that this metadata is also limited, clamped to 512 bytes. Assuming we need a GUID-sized counter for a unique tweet ID, that puts our payload at 200+512+16 = 728 bytes. That's less than half the size of default ethernet MTU -- 1500 bytes. IP allows packets up to 64K in size, and with jumbo ethernet frames we could avoid fragmentation at the link level and still accomodate 9K packets, but there's no need to worry about that now.
Twitter's subscriber base also fits neatly in the IPv4 address range of ~4 billion unique addresses. Even if we were to give every subscriber an address for every one of their subscribed delivery endpoints (SMS, web, etc.), we'd still fit nicely in IPv4 space. Raffi's slides show that they want to serve all of earth which means eventually switching to IPv6, but that's so far away from the trend line that we can ignore it for now. That means we can handle addressing (source and destination) and data in the size of a single IP packet and still have room to grow.
So now we're down to the question that's been in the back of my mind for years: can we buy Twitter's core routing and delivery function off the shelf? And if so, how much would it cost, assuming continued network growth? Assuming 4x average peak and a 2K/s inbound message baseline (enough to get them through 2011?) and an average fanout of 300 (we're being super generous here, after all), we're looking at 2.5 million packets to route per second. If we treat each delivery endpoint as an IP address and again multiply deliveries by endpoints and assume 4 delivery endpoints per user, we 're looking at a need to provision for 10M deliveries per second.
Is that a lot? Maybe, but I have reason to think not.
10M 1.5KB packets is ~15GB/second of traffic. Core routers now do terabits of traffic per second (125GB), but most of that traffic doesn't correspond to unique routes. Instead, we need to figure out if hardware can do either the 2.5M or 10M new "connections" per second that the Twitter workload implies. Ciscos's mid-range 7600 series appears to be able to handle 15M packets per second of raw forwarding. Remember, this is an "internal" network, no advanced L3 or L4 services -- just moving packets from one subnet to another as fast as possible, so quoting numbers with all the "real world" stuff turned off is OK.
I'm still not sure that I fully grok the limits of the gear I see for sale since I'm not a network engineer and most "connection per second" numbers I see appear to be related to VPN and Firewall/DPI. It looks like the likely required architecture would have multiple tiers of routing/switching to do things efficiently and not blow out routing tables, but overall it still seems doable to me. This workload is admittedly weird in it's composition relative to stateful TCP traffic and I have no insight into what that might to do in off-the-shelf hardware -- it might just be the sticky wicket. Knowing that there's some ambiguity here, I hope someone with more router experience can comment on reducing the Twitter workload to off-the-shelf hardware.
Perhaps the large number of unique and short-lived routes would require extra tiers that might reduce the viability of a hardware solution (if only economically)? ISTM that even if hardware can only keep 2-4M routes in memory at once and can only do a fraction of that in new connections per second, this could still be made to work with semi-intelligent "edge" coalescing and/or MPLS tagging...although based on the time it takes to get a word of memory from main memory (including the cache miss) on modern hardware, it seems feasible that tuned hardware should be able to do at least 1M route lookups per second which puts the current baseline well within hardware and the 2011 growth goals within reach.
So I'm left back where I started, wondering what's so hard? Yes, Twitter does a lot besides delivering messages, but all of those things (that I understand and/or know about) have the wonderful behavior that they're either dealing with the (relatively low) inbound rate of 4K messages/s (max) or that they're embarrassingly parallel.
So I ask you, lazyweb, what have I missed?
I've been meaning to move this blog off a sub-domain affixed to the Dojo Project for some time, but I finally got all the pieces lined up last night. This blog will continue to be technical in nature and I'll set up another location here on
infrequently.org for political stuff. Thanks for being patient with any latent brokenness.
The concepts of negative externality and moral hazard describe situations where one person can impose costs on another without paying for it, often resulting in less-than optimal outcomes for everyone. That sounds a lot like what's going on with organizations that won't upgrade from IE6 to me. Lets quickly consider both sides of the browser equation and then sketch out some possible solutions, keeping in mind that the assumed goal is a better, less frustrating web experience for users and developers. We'll also look to see how this stacks up with the fairness goal of buyers paying full-freight for the costs of production.
Firms have incentive to maximize return on investments, meaning not switching immediately when better browsers are available, even if the nominal price is zero since the real price may be much higher. Retraining, support, validation, and rework of existing systems that won't work with a new browser all add up to create a large disincentive to any change. A new browser -- or even a new version of an existing browser -- has to be worth enough to outweigh those potential costs. It may cost real money just to figure out if upgrading will cost a lot. Lets assume that organizations are deciding under uncertainty.
Web developers want their customers to pay at least what it costs to produce an app. This may be hard to estimate. They'd also like to deliver competitive apps at as low a cost as possible and often want to maximize the size of their addressable market, which means supporting the broadest swath of browsers as possible. They could build features once for old browsers and again (perhaps better) for new browsers, but that's expensive. Only the largest sites and firms can contemplate such a strategy, and usually as a way of mopping up marginal market share once they've "won" the primary market battle. Developers of new apps have strong incentives to build to the least common denominator and address the largest potential market.
So what browsers to include? There's historical data on browser share but things move slowly enough that the future is going to look a lot like the present, particularly related to development cycles. Public statistics on browser share may not even resemble the market for a vertically focused product. Enterprise software developers can count on more legacy browser users than consumer sites. In any case, it's unlikely that a firm knows all of its future customers. It pays to be conservative about what browsers to support.
What if a developer builds an app, bears the pain of supporting old browsers, but does not sell many units to users of old browsers? There's potential deadweight loss in this case, but it might be OK; the developer reduced their uncertainty and that's worth something.
What's good for a single firm may be bad for the ecosystem, though. The cumulative effects of this dynamic compound. Application buyers are also the market for browsers, but on different time scales. The costs of a browser upgrade may not be known and may dwarf the cost of any individual app, making it unlikely that cost savings for an app targeted at newer browsers will win the day. More likely the customer will lean on their supplier to support their old browser. Mismatches in size and clout between vendors and clients amplify this dynamic. What small consulting firm can tell a Fortune 500 firm who may be their largest customer to go stuff it if they don't upgrade from IE 6? Small vendors may be able to target more than just the supported browsers at their largest client, but again potentially taking deadweight losses. Large, slow moving organizations may hurt individual apps but cumulatively they can also rob the market of growth thanks to the third linkage: the connection between browser makers and application developers.
It might come as a surprise, but browser vendors care very much what web developers do. We see this in the standards process where a lack of use is cause for removing features from specs. After all, standards are insurance policies that developers and customers take out against their investment in technologies -- in this case browsers and the features they support. It doesn't make sense to insure features that nobody is using. Developers whose clients are slow-moving may shy away from using new features, robbing the process of the feedback that's critical in cementing progress. With the feedback loop weakened, browser makers may assume that developers don't want new features or don't want the ones they've built. Worse, they may wrongly think that developers just want better/faster versions of existing features, not new features that open up new markets.
I've glossed over lots of details at every step here, but by now we can see how the dynamic caused by legacy content in organizations that demand continuity robs us all of forward momentum. More frustratingly, we can also see how everyone in the process is behaving rationally(ish) and without obvious malice. That doesn't mean the outcomes are good. If firms could make new web features available for their suppliers to target faster, they would strengthen the feedback loop between developers and browser makers and also reduce their own procurement costs for applications, assuming they could continue to use their old applications. The key to enabling this transition to a better equilibrium lies in reducing those potential costs of change. In many ways, that comes down to reducing the uncertainty. If new features could work along side legacy content without retraining, added support costs, and without the need for exploratory work to understand the potential impacts, organizations should be more willing to accept modern applications. We need to make free cheaper.
There are other ways of addressing market imbalances like this, of course. One traditional answer is for governments to tax those who externalize their costs onto others, bringing the actual price of goods back into line. Regulation to prevent externalization in the first place can also be effective (e.g., the Clean Water Act). The use of the courts to find and provide remedies sometimes works but looks implausible here -- you'd need a court to accept a theory of "browser pollution" in order to show harm. Derivative contracts may allow first parties (developers) to spread their potential costs, assuming they can find buyers who can judge the risks, but this looks to be a long way out for web development. Building basic schedules for relatively differentiated goods is hard enough. Asking others to trade on one small-ish aspect of a development process feels far-fetched.
Reasonable people disagree about how we should attack the problem. My own thinking on the topic has certainly evolved.
Next time you hear someone say "if only I could use X", remember that the way we'll get to a better future is by bringing everyone else along for the ride. We won't get there by telling them what to do or by implying with moral overtones that their locally optimal decision is "wrong". Instead we can bring them along by understanding their interests and working to reduce the very real friction that robs us all of a better future. You can do your part by opting your pages into the future and working with your users to help them understand how cheap free has truly become.
Side-by-side versions of different browsers are critical for us webdevs. On some browsers, it's because multi-year-old versions are still prevalent (luckily, you can start ignoring them). In the case of Chrome, nearly all users are up-to-date with the very latest Stable release. Being the forward thinking chap/gal you are you want to try out new stuff like WebGL that's only available in the Dev channel. But you don't want to mistakenly build something that uses features users don't have yet. What to do? Side-by-side installs, a.k.a. the Canary channel! Now you can test with Stable while plotting world domination on Dev. Awesome.
This new version will auto-update with the new hotness at roughly the same rate as today's Dev channel but will allow you to install and run it alongside a Stable channel version of Chrome. This new channel is Windows-only for now, but you have VMs for testing anyway, right? Happy testing!
You can get yours here. Note that these are Dev Channel builds, so they contain the new hotness. Also, the new bugs. Caveat emptor.
Huge props to Robert Shield who has been working through the endless details of this effort for months.