Twitter By The Back Of A Napkin

In which you talk me into finally getting a Twitter account by explaining to me why I don’t understand Twitter.

I’m a Twitter luddite for perhaps the most pedantic of excuses: for years I’ve scratched my head at why what seemed like a solved problem has eluded Twitter in its search for scale with stability. A new presentation by Twitter engineer Raffi Krikorian deepens my confusion. First the numbers:

Avg. Inbound Tweets / Second 800
Max. Inbound Tweets / Second 3283
Tweet Size (bytes) 200
Registered Users (M) 150
Max Fanout (M) 6.1

Social networks like Twitter are just that — networks — and to understand Twitter as a network we want to know how much traffic the Twitter “backbone” is routing. Knowing that Twitter does 800 messages inbound per second doesn’t tell us but an estimate is possible. From a talk last year by another Twitter engineer, we know that Twitter users have less than 200 followers on average. That means that despite the eye-popping 6.1M follower (in networking terms “fanout”) count for Lady GaGa, we should expect most tweets to generate significantly less load. Dealing just in averages, we should expect baseline load to be roughly 100K delivery attempts per second. Peak traffic is likely less than 1.5M delivery attempts per second (4K senders w/ double the average connectedness plus some padding for high-traffic outliers).

Knowing that peak loads are 4x average loads is useful and we can provision based on that. We also know that Twitter doesn’t guarantee message order and has no SLA for delivery which means we can deal with the Lady GaGa case by smearing delivery for users with huge fanout, ordering by something smart (most active users get messages first?). Heck, Twitter doesn’t even guarantee delivery, so we could even go best-effort if the system is congested, taking total load into account for the smear size of large senders or recovering out of band later by having listeners que a DB. So far our requirements are looking pretty sweet. Twitter’s constraints significantly ease the engineering challenge for the core routing and delivery function (the thing that should never be down).

What about tweet size? How much will an individual tweet tax a network? Can we handle tweets as packets? Tweet text is clamped to 200 bytes (as per Raffi’s slides) but Tweets now support extra metadata. The Twitter API Wiki notes that this metadata is also limited, clamped to 512 bytes. Assuming we need a GUID-sized counter for a unique tweet ID, that puts our payload at 200+512+16 = 728 bytes. That’s less than half the size of default ethernet MTU — 1500 bytes. IP allows packets up to 64K in size, and with jumbo ethernet frames we could avoid fragmentation at the link level and still accomodate 9K packets, but there’s no need to worry about that now.

Twitter’s subscriber base also fits neatly in the IPv4 address range of ~4 billion unique addresses. Even if we were to give every subscriber an address for every one of their subscribed delivery endpoints (SMS, web, etc.), we’d still fit nicely in IPv4 space. Raffi’s slides show that they want to serve all of earth which means eventually switching to IPv6, but that’s so far away from the trend line that we can ignore it for now. That means we can handle addressing (source and destination) and data in the size of a single IP packet and still have room to grow.

So now we’re down to the question that’s been in the back of my mind for years: can we buy Twitter’s core routing and delivery function off the shelf? And if so, how much would it cost, assuming continued network growth? Assuming 4x average peak and a 2K/s inbound message baseline (enough to get them through 2011?) and an average fanout of 300 (we’re being super generous here, after all), we’re looking at 2.5 million packets to route per second. If we treat each delivery endpoint as an IP address and again multiply deliveries by endpoints and assume 4 delivery endpoints per user, we ‘re looking at a need to provision for 10M deliveries per second.

Is that a lot? Maybe, but I have reason to think not.

10M 1.5KB packets is ~15GB/second of traffic. Core routers now do terabits of traffic per second (125GB), but most of that traffic doesn’t correspond to unique routes. Instead, we need to figure out if hardware can do either the 2.5M or 10M new “connections” per second that the Twitter workload implies. Ciscos’s mid-range 7600 series appears to be able to handle 15M packets per second of raw forwarding. Remember, this is an “internal” network, no advanced L3 or L4 services — just moving packets from one subnet to another as fast as possible, so quoting numbers with all the “real world” stuff turned off is OK.

I’m still not sure that I fully grok the limits of the gear I see for sale since I’m not a network engineer and most “connection per second” numbers I see appear to be related to VPN and Firewall/DPI. It looks like the likely required architecture would have multiple tiers of routing/switching to do things efficiently and not blow out routing tables, but overall it still seems doable to me. This workload is admittedly weird in it’s composition relative to stateful TCP traffic and I have no insight into what that might to do in off-the-shelf hardware — it might just be the sticky wicket. Knowing that there’s some ambiguity here, I hope someone with more router experience can comment on reducing the Twitter workload to off-the-shelf hardware.

Perhaps the large number of unique and short-lived routes would require extra tiers that might reduce the viability of a hardware solution (if only economically)? ISTM that even if hardware can only keep 2-4M routes in memory at once and can only do a fraction of that in new connections per second, this could still be made to work with semi-intelligent “edge” coalescing and/or MPLS tagging…although based on the time it takes to get a word of memory from main memory (including the cache miss) on modern hardware, it seems feasible that tuned hardware should be able to do at least 1M route lookups per second which puts the current baseline well within hardware and the 2011 growth goals within reach.

So I’m left back where I started, wondering what’s so hard? Yes, Twitter does a lot besides delivering messages, but all of those things (that I understand and/or know about) have the wonderful behavior that they’re either dealing with the (relatively low) inbound rate of 4K messages/s (max) or that they’re embarrassingly parallel.

So I ask you, lazyweb, what have I missed?

9 Comments

  1. Posted September 12, 2010 at 11:59 pm | Permalink

    (Re-posted from my comment on Louis Gray’s Google Buzz)

    OK, but based on these numbers, what do you expect twitter’s uptime or stability to be? and how far are they from that?

    And, more importantly: what relevance does that have in deciding whether or not to use the service? A quick Google search reveals a Techcrunch article putting their 2007 downtime at 6 days, I feel recent years have been much better.

    I probably wouldn’t want to use it as a 911 replacement, but a couple of days a year doesn’t mean you should stay away from a service (free, at that), if you can get something useful out of it.

  2. Posted September 13, 2010 at 12:50 am | Permalink

    Hey Jorge,

    Uptime? I dunno, what’s the average uptime for a redundant pair Cisco’s based on MTBFs? Lets say that failover sucks and upgrades take time and cause problems…lets set our sights low can call it four 9′s for the core function. That’s an hour of downtime per year. Other systems inside of Twitter (search, web UI, various delivery backends) will probably have dependent downtime larger than that but within the same constraints since they’re more-or-less parallizable and some can remain up even if the core function goes down (unless, again, there’s something about Twitter that I don’t grok). ISTM they’re a long way off.

    As for 911 replacement, one of the most compelling uses for Twitter that I’ve ever heard is my friend Stasha (who’s heavily involved w/ local disaster recovery preparedness efforts here in SF) telling me a story about a friend whose best way in an emergency to tell family and friends that they were OK was via Twitter. It was a compelling story that gets to the core of why Twitter is valuable. The relationship asymmetry makes it even more valuable in some situations and so does the compelling story about APIs and endpoints.

    Regards

  3. Devdas Bhagat
    Posted September 14, 2010 at 11:40 pm | Permalink

    The concurrent session stuff would be interesting if the numbers were for multicast streams.

    Routers do a best match, with about 337181 active routes in the Forwarding Information Base.
    See http://bgp.potaroo.net/as1221/bgp-active.html for the source of that number.

    The routing information is fairly stable, with each packet coming in, getting matched to one route and being pushed out. One packet enters, one packet leaves. No new packets are generated.

    With twitter, the lookups are slightly more complex, where one packet enters and multiple packets leave. That adds some complexity to the lookup.

    New packet generation is significantly more effort in terms of CPU performance.

    With Twitter, you are doing exact match lookups on ~ 150M addresses in software, which is expensive (as opposed to 337K which is pushing at the limits of hardware right now).

  4. Posted September 15, 2010 at 10:33 am | Permalink

    In the design I described, endpoints pre-explode all packets meaning they don’t have to be multicast. So long as you can route them, you can have the endpoints do the packet generation. Multicast makes the routers do that explosion, but if they can do it to subnets, that’s might good enough. I was thinking something similar w/ MPLS tagging and/or broadcast to an individual subnet to keep the total # of routing operations down (just match the tag, send to the next hop where it’ll pop the tag, and then maybe have receivers filter on local subnet broadcast which will never be more than peak-inbound rate, which is low). Endpoints might only need to generate one packet per destination subnet in that case, which should lower total core traffic by an order of magnitude.

    If you were really in an AS-based configuration, you’d be routing on prefix so the core traffic wouldn’t need the full 150M addresses (which is only 10 /8′s), just the subnets you’ve decided to group addresses by. That’d increase the hardware requirements to 2 tiers plus endpoints, but that’s still relative cheap for only 150M addresses.

  5. Devdas Bhagat
    Posted September 15, 2010 at 2:17 pm | Permalink

    What bit is the endpoint? The sending client? Then instead of sending one message to Twitter, I send a lot more messages. If the endpoint is basically a host within the Twitter internal system which is accepting my message, exploding it and then sending it to a lot of other endpoints, then the work of multicast is being done by the Twitter system.

    The better analogy here would be email messages and mailing lists with VERP. Send in one message, send out a lot more individual messages to each recipient address.

    That is a slightly harder problem to solve.

    While I agree that good network design does imply intelligent edge nodes and stupid central routing, Twitter doesn’t have that benefit.

    They are more of a telco style model with multicast on top.

  6. Posted September 21, 2010 at 2:27 am | Permalink

    I’m under the impression that message routing hasn’t been their major problem in some time. That seemed to be the case as of Chirp. I admit I didn’t sit through their most recent presentation so perhaps I’m behind.

    Assuming for a moment that I’m behind and tier’ed message queues can’t do for Twitter what they’ve done for the stock exchanges for many years and that there is a compelling reason to model application layer message delivery as routed packets. How do you address durability? TTL on a packet isn’t going to buy you much. Your routers aren’t designed to queue packets for long, much less persist. What do network partitions look like, e.g. where do the messages go when your routing table is flapping.

    It’s absolutely true that you can model Twitter messages as packets. You could also model them as messages over SMTP. Or messages on a message queue ala Kestrel … etc. It seems like you’re attacking an uninteresting problem or at least one of the less interesting problems in their space.

    For your next napkin post I’d be curious to see how you would address durability in the face of various failure scenarios. Imagine every component in your system will fail [because they will]. You can be eventually consistent but you can’t accept even modest data loss. Even more interesting would be you address the read and write stories for delivering to a collection of timelines.

  7. Posted September 21, 2010 at 4:10 am | Permalink

    Hey Matt,

    I agree this problem isn’t interesting, but I find it curious how knowledge gets lost. Reminds me of: http://web.mit.edu/krugman/www/dishpan.html

    Eventual consistency seems good enough for Twitter. You don’t want packet-level TTL for durability, we want ACK so that senders can know to retry on the send side (modulo backoff). That’ll double the work for large senders, but twitter doesn’t guarantee delivery, let alone timely delivery. Given that only routes look to be expensive (not actual traffic), that’s probably fine regardless. We can have longer conversations without hitting our bottleneck for some time. From there, it’s good-ole HA: duplicate what you must, over-provision by less where you can, and for godsake don’t ask a DB to answer questions that you can keep the answers to in memory (who are my subscribers? who do I subscribe to?). Heck, SSD is cheap and will do 10K+ random read IOPS.

    Front-ends could connect to user-storage-servers (the “external” IP for the daemon connected to the message routing system), not “app servers” in the traditional sense (search and API are special since they rarely write and will want to shard by things other than user). Each user server can manage its own state (tweets, following (inexact), followers (exact), etc.), answer questions via an API, manage its own inbound and outbound queues, and coordinate with one or more fail-over replicas. Daemon startup would obviously be read-heavy from a central server, but in normal operation can be phased for code rollouts assuming some care is taken in forward-compatibility for RPC message format (protobufs, thrift, etc.). Inbound message processing will need to look aside to multiple services before eventual delivery, but those are independent. What matters here is that the inbound system keeps state sanely, can flush to some form of replication quickly (network to other DC + memory store may be better than disk, but I don’t get the sense that they’re burdened by write rates at the edge), and can walk through delivery incrementally, eventually informing a central DB of the message and successful delivery.

    As for how to survive everything, that’s not in Twitter’s SLA (obviously), and it may not be in their economic interests. Would depend on OPEX, most likley.

  8. Posted September 21, 2010 at 10:03 am | Permalink

    “In which you talk me into finally getting a Twitter account …” So, you got a Twitter account?

  9. Posted September 22, 2010 at 8:15 pm | Permalink

    Not yet, but Pete’s got my (fake) back: http://twitter.com/FakeAlexRussell