Overview of technical problems with Identi.ca
(UPDATE: see Further info on Identi.ca problems for more info.)
We've been having some technical issues with identi.ca for the last few weeks, and there's been a lot of speculation about what those problems are. I'd like to outline what the problem is and let people know what I'm doing to solve it.
The Identi.ca site has a very traditional LAMP layout. There is a load balancer that distributed HTTP and HTTPS hits over 7 web servers. These servers use a pair of back-end database servers -- one master, for database writes, and a slave, which replicates from the master, for reads. (Actually, reads are unevenly split across master and slave, since there are a lot more reads than writes.)
We off-load a lot of tasks to off-line processing app servers. Web servers push tasks like distributing data over OStatus off to a queue system, since they don't have to be done "at web time". There are PHP-based daemon processes that read new tasks from an Apache ActiveMQ server, and process them.
Finally, we use memcached pretty extensively throughout the codebase. As much as possible, our database hits are checked with memcached first before we check with the database. This gives us really great performance... usually. However, there's also a downside to it.
Our memcached array is spread across our app and web servers. This means that, for N servers, 1/N memcached requests will be a local request -- which we hoped would be a good design. Unfortunately, it's also very fragile. If any of our app or web servers go down, it means 1/N memcached requests will fail -- which means more database requests, which slows down the whole site.
And, unfortunately, we've been having a problem just like this. About once per day, one of our Web servers hits very high loads and stops responding to Web requests. This would normally not be such a big deal, but the server also stops responding to memcached requests, which slows down other requests significantly. This results in slower loads for all users.
It seems like the most probable suspect in the mysterious shutdowns of servers is a Web hit that's causing an infinite loop or a connection to an unusable server. There's not a clear culprit, however. Diagnosis has been hard since we haven't seen the problem on other server or in development environment. Its relative rarity (once a day means once ever million or so hits) would suggest some obscure functionality, but there's not a clear sign what it is.
I'm spending some time trying to do more specific logging and tracking down the problem, but until that happens I have to restart the errant server by hand (well, by APC...). That can sometimes take a while, especially in the middle of the night North America time.
I appreciate people's patience while we track this down.