Bulk delivery issues

The Ashton Kutcher problem
Most people in a social network have only a few dozen or a few hundred direct followers. The most popular 'regular' people might have a couple thousand fans and casual acquaintances. But of course, whenever you give something to people on the Internet, it's going to go places you didn't expect!

Actor Ashton Kutcher was famously the first to gain one million followers on Twitter. For platforms like StatusNet which concentrate on combining multiple smaller sites together into a federated network, this is an interesting case to consider; a different scale of relationships than we originally expected can produce very different performance characteristics.

Common case
Our current generation of processing infrastructure is mostly designed for the 'outgoing' case:


 * One of our users sends a message
 * We do a little bit of work immediately from the web server to save the message, then let the user get back to what they were doing.
 * Most of the work gets spread into a few background queue events, which get run as soon as they're able:
 * save to local inboxes, send email notifications
 * send text messages to any subscribers requesting them
 * send XMPP chat messages to any subscribers requesting them
 * send realtime web updates for anybody watching the site in a web browser
 * mirror the message out to other services like Twitter and Facebook

At any given time we only have to deal with the number of messages actually being sent or received by our users, which isn't too insane. One message usually means just a handful of queue events to process; if there are a lot of followers we might send out a few hundred notifications by email, sms, chat, or pings to remote sites, which can be spread over multiple servers to get through them as fast as possible.

Meet our Ashton
Currently the worst-case scenario in StatusNet's system involves the combination of our OStatus site-to-site federation and a large number of small subscribing sites all hosted in the same place: the cloud@status.net announcement address.


 * We publish a maintenance announcement from cloud@status.net
 * cloud.status.net's PuSH hub pumps out a separate POST to each of 19,090 individual hosted *.status.net sites
 * Each of those sites schedules its own queue processing:
 * process the incoming PuSH feed into a notice and save it locally
 * save to local inboxes, send email notifications
 * send text messages to any subscribers requesting them
 * send XMPP chat messages to any subscribers requesting them
 * send realtime web updates for anybody watching the site in a web browser

Now our single outgoing message balloons into 114,540 individual events, which each need to be processed ASAP... preferably without disrupting everything else. In production we've seen these causing delays and disruption for anywhere from 30 minutes to a couple hours before everything's cleared up.

What slows things down
The first problem is simply that our federation protocol is optimized for separately-hosted sites. When we send a message to 20k recipients within a single site, we have a lot less work to do! If we send to 20k separate sites, it'll still take a while to reach them all but we only have to send those single notifications. The rest of the work gets done on those other sites, spreading the work over thousands of different servers.

Here, our federation actually works against us: we need to do all the work of remote sending and receiving, but have to do it all in the same place.

Test system
After getting some live data on a couple of our updates, I'm doing some experimentation on a local test system with 1000 destination sites and 8 background processing threads. (In production we're running 64 processing threads over a couple app servers, main load shared over several web and database servers, and have 19x as many destination sites.)

Delivery times run about 9 minutes to completion for the best cases with minimal tuning -- much longer when things go ill -- but I'm hoping to knock that down still farther and eliminate latency disruption to other jobs.

CPU time during delivery is relatively small; there's a lot of time spent idle or waiting on i/o. Everything goes through queues, memcached, and MySQL.

Repeated work
The first problem we find is when we're scheduling those outgoing POSTs. I've actually found that the job that enqueues all the outgoing items often dies partway through. The result is that the same job gets re-run... from the beginning. We end up actually sending the same message to the same servers several times, making extra work!

The failures can also cause temporary connection loss to the queue server, which takes the processing thread out of commission until it reconnects.

I'm trying to improve this by splitting up the enqueueing into smaller pieces which can complete more reliably and cause less duplication if they do fail.

Just between you and me
There's also just plain some work that doesn't need to be done; since we're running on the same servers, we could skip a couple steps in the site-to-site distribution.

We've already eliminated the need for an HTTP hit from the publisher to the hub if there are no subscribers; we could also drop the HTTP hit between the hub and the subscriber when we know we are the subscriber! Instead of running through the web server to do the job, we could directly enqueue the PuSH input queue item to the other site's queue, skipping a couple context switches and lightening load on the web servers.

I haven't yet tested how much time this would save, but it should be a big win on the outgoing side.

Context-switching
There's overhead to our background processing: it takes some time to push jobs in and out of the queues, and when we handle jobs for multiple sites we have to reload configurations and reestablish database connections every time we switch. While it only takes a few milliseconds to change sites, that adds up: just 10ms per item blows up to 3 minutes of wasted processing time at this scale.

Now, we have to do at least one context switch per site in order to process the incoming message. But if we only have to do it once instead of 5 times, that could be a big win on the incoming side.

Null work
Right now there's a fair amount of null work happening: time we spend processing a case that doesn't happen.

Most of the sites we're pushing the message to are single-user sites and won't trigger an e-mail, SMS, or XMPP notification on the incoming message, so time we spend preparing notifications is wasted. Most of the sites aren't going to be open in someone's web browser at any given time, so time we spend preparing a real-time web update is wasted.

Eliminating jobs that aren't actually going to be needed before we queue them up is probably the best way to cut down on the context-switching overhead.

Latency and disruption
The other problem we have is that while all these jobs are running, we've flooded the same queues we're trying to handle other things on. If it takes 30 minutes to settle down, that's 30 minutes that our other activity is also going to be delayed. We've seen disruption to Facebook and Twitter mirroring, XMPP chat and real-time web updates, and other background jobs while bulk announcements were being processed... everything works, but new jobs might not be processed for 30 minutes or more -- ouch!

The latencies aren't totally even right now, since activity of various stages is going on at the same time -- some of the incoming messages are being processed while we're still enqueueing outgoing messages -- but the sheer number of things pounding into the queues means that any new message has to fight for processing time with hundreds of copies of the same announcement.

Ideally, we want everything to run smoothly; new messages going out or coming in with smaller audiences should keep going through right away, while our giant mass announcement can probably afford to take a few extra minutes to reach some people.

We can spread things out a bit by breaking the 'hubout' and 'pushin' jobs out to their own queues. This'll let other event types keep running in between the low-level OStatus PubSubHubbub activity -- but would leave any new OStatus-based messages still waiting. Disrupting our entire federation system isn't ok!

So far it's looking like a winner here may be to divide up the enqueueing job for the hub output differently. Currently we queue up output to all destination sites as quickly as we can, then let the queueing system run through them at whatever pace it can; any new outgoing messages end up at the end of the queue, waiting for all 19k updates to go out first.

Instead, we can enqueue, say, the first 50 sites and then a job telling us to enqueue the next batch. Once we've pushed out to the first 50 (or at least tried to), we read the batch item and enqueue the next 50 sites, and the next batch marker. Any new outgoing messages while this is still running will be interleaved in, and we end up with at least a little rate limiting on the outgoing.