HTTP caching

At some point soon we'll want large installations such as status.net's to be able to reduce their footprint on the main web servers by making more aggressive use of HTTP caching.

The basics
When using a caching reverse proxy we add a second layer of HTTP negotiation in between the client and the ultimate web servers. The ideal is that a large number of requests can be handled directly by the proxy -- which is relatively 'dumb' and works with pre-built responses -- without having to pass them back to the web servers which have to go through slower PHP interpretation, database queries etc.

++       | apache+php | ++           ^   |            |   V   +---+ | caching reverse proxy | |  Squid or Varnish    | +---+           ^   |            |   V       +--+ |  browsers   | +--+

Based on headers on previous responses, a client can do one of three things:
 * use its locally cached copy without going to the server
 * ask the server for a fresh page
 * ask the server for a fresh page *only* if there's a mismatch based on If-Last-Modified timestamp or Etag value

At each level, the server side can respond one of two main ways:
 * 200 OK -- push out a bunch of new data
 * 304 Not Modified -- tells the client to reuse its locally cached copy, doesn't send any page data
 * can only do this if we had a timestamp or etag in the request

The fastest case is to not request something at all. ;)

Fun proxy tricks
Things can get more exciting when the application knows about the proxy, because you can set up different caching behavior on the inside than on the outside. This can let the proxy cache things more aggressively, reducing hits to the backend while still letting clients see the latest changes without waiting for an expiration.

For example:
 * 1) client loads "friends timeline" page
 * 2) * proxy doesn't have it, sends it to php backend
 * 3) php backend pushes out the timeline page, with cache info:
 * 4) * proxy should cache it for an hour (s-maxage: 3600)
 * 5) * clients should check back every time (must-verify)
 * 6) client reloads
 * 7) * proxy serves it straight out of cache [or even as a 304]
 * 8) a new message comes in to that user's inbox...
 * 9) * inbox backend pings the proxy to clear the cache for that page
 * 10) client reloads
 * 11) * proxy fetches new timeline page from backend

Since we already have to do similar cache clearing internally for inboxes cached in memcached, this should be an easy extension... the trick is that you're generally limited to hitting certain URLs.

Tricky things

 * authentication: make sure that proxies aren't serving authenticated pages to anyone else
 * can use 'Vary: cookie'
 * and with some extra tricks like Wikimedia's Squid patches we can have it pay attention only to the cookies that affect server-side state, ignoring JS-only cookies
 * API
 * A huge portion of our hits are to the API, which could mean a lot of different folks' data on the same URL. Need to investigate to see best way to handle these [eg can we invalidate one persons friends timeline API hit but not someone else's? do we even need to? will most hits be using relatively unique URLs due to since_id etc?]

API
The majority of hits to our web servers are API hits. At this time we don't have a good breakdown of hits beyond a few limited bits of information from web logging; the ApiLogging plugin offers some additional info on what parameters are being passed, but we haven't tested it yet.

There are some major difficulties with planning proxy-level HTTP caching for the API that could actually prevent hits from having to reach web servers...

Hit pattern:

Most well-behaved clients will look like this:
 * 1) hit the API for the timeline
 * 2) after delay, hit the API for the timeline, passing a since_id parameter with the last notice ID we saw on the timeline
 * 3) * if no new notices, continue looping on the same URL
 * 4) once we've received and processed new notices, go back to previous step with updated since_id.

Most clients _probably_ do not pay attention to Etags or Last-Modified headers and would never benefit from improved support for HTTP caching on the user-agent end.

In theory, this still gives us an opportunity to cache the negative responses at a reverse proxy cache on our end, keeping some hits from having to reach the web servers:


 * 1) statuses/friends_timeline.xml
 * 2) * cache MISS -> web server. Last notice in the timeline is number 2345.
 * 3) statuses/friends_timeline.xml?since_id=2345
 * 4) * cache MISS -> web server. No notices in response; it gets saved into proxy cache.
 * 5) statuses/friends_timeline.xml?since_id=2345
 * 6) * cache HIT (response with no notices)
 * 7) statuses/friends_timeline.xml?since_id=2345
 * 8) * cache HIT (response with no notices)
 * 9) (a new message comes in)
 * 10) * StatusNet needs to tell the proxy server to PURGE the URL statuses/friends_timeline.xml?since_id=2345 from its cache!
 * 11) statuses/friends_timeline.xml?since_id=2345
 * 12) * cache MISS -> web server includes notices in the response; it gets saved into proxy cache.
 * 13) statuses/friends_timeline.xml?since_id=2346
 * 14) * cache MISS -> web server no notices in response; it gets saved into proxy cache. (and we repeat the cycle...)

There is a problem, though, which is that the...

URL space is open-ended:
 * statuses/friends_timeline.xml
 * statuses/friends_timeline.xml?since_id=2345
 * statuses/friends_timeline.xml?since_id=2346...
 * statuses/friends_timeline.xml?count=20
 * statuses/friends_timeline.xml?count=20&since_id=2345
 * statuses/friends_timeline.xml?count=20&since_id=2346...
 * statuses/friends_timeline.xml?user_id=1234
 * statuses/friends_timeline.xml?user_id=1234&since_id=2345
 * statuses/friends_timeline.xml?user_id=1234&since_id=2346...
 * statuses/friends_timeline/1234.xml
 * statuses/friends_timeline/1234.xml?since_id=2345
 * statuses/friends_timeline/1234.xml?since_id=2346...
 * statuses/friends_timeline/1234.xml?count=20
 * statuses/friends_timeline/1234.xml?count=20&since_id=2345
 * statuses/friends_timeline/1234.xml?count=20&since_id=2346...

Even for the same timeline, there's a number of different ways it can be called. Different clients might use different count parameters, add other parameters, might explicitly use the user_id, or might be calling with an older since_id than is current.

Can we reliably know which URLs to purge from the proxy cache?

We could perhaps limit caching to some common cases with canonical URLs:


 * statuses/friends_timeline.(xml|json|atom)?since_id=()

Then when that timeline changes, we only need to purge the canonical form with a since_id listing what was previously the top thing in the timeline.

Problem:
 * any client passing a count parameter will not benefit from the proxy caching
 * any client using another not-quite-canonical URL will not benefit from the proxy caching

Variances

Since the default canonical URLs for most timelines aren't unique to the user, we'd need to make sure any proxy caching avoids sending other people your output...

Authentication on the API can be based on either:
 * cookie auth (sorta, sometimes, vaguely)
 * Authorization header
 * with Basic auth: value remains the same as long as username/password don't change
 * with OAuth: value changes on every hit (ever-changing timestamp and nonce in OAuth signature)

A 'Vary: Cookie, Authorization' kinda header might help keep these things distinct for basic auth, but would prevent OAuth accesses from ever being cache hits.

Adding some kind of specialized header munging to the caching proxy system so it lets OAuth hits be cached could work; it would defeat the replay-attack protection from OAuth but this wouldn't be terribly exciting on empty timelines. ;)