Overview of technical problems with Identi.ca

Evan Prodromou's picture

(UPDATE: see Further info on Identi.ca problems for more info.)

We've been having some technical issues with identi.ca for the last few weeks, and there's been a lot of speculation about what those problems are. I'd like to outline what the problem is and let people know what I'm doing to solve it.

The Identi.ca site has a very traditional LAMP layout. There is a load balancer that distributed HTTP and HTTPS hits over 7 web servers. These servers use a pair of back-end database servers -- one master, for database writes, and a slave, which replicates from the master, for reads. (Actually, reads are unevenly split across master and slave, since there are a lot more reads than writes.)

We off-load a lot of tasks to off-line processing app servers. Web servers push tasks like distributing data over OStatus off to a queue system, since they don't have to be done "at web time". There are PHP-based daemon processes that read new tasks from an Apache ActiveMQ server, and process them.

Finally, we use memcached pretty extensively throughout the codebase. As much as possible, our database hits are checked with memcached first before we check with the database. This gives us really great performance... usually. However, there's also a downside to it.

Our memcached array is spread across our app and web servers. This means that, for N servers, 1/N memcached requests will be a local request -- which we hoped would be a good design. Unfortunately, it's also very fragile. If any of our app or web servers go down, it means 1/N memcached requests will fail -- which means more database requests, which slows down the whole site.

And, unfortunately, we've been having a problem just like this. About once per day, one of our Web servers hits very high loads and stops responding to Web requests. This would normally not be such a big deal, but the server also stops responding to memcached requests, which slows down other requests significantly. This results in slower loads for all users.

It seems like the most probable suspect in the mysterious shutdowns of servers is a Web hit that's causing an infinite loop or a connection to an unusable server. There's not a clear culprit, however. Diagnosis has been hard since we haven't seen the problem on other server or in development environment. Its relative rarity (once a day means once ever million or so hits) would suggest some obscure functionality, but there's not a clear sign what it is.

I'm spending some time trying to do more specific logging and tracking down the problem, but until that happens I have to restart the errant server by hand (well, by APC...). That can sometimes take a while, especially in the middle of the night North America time.

I appreciate people's patience while we track this down.

Comments

detecting the bug

one way to detect the bug is using strace on the process

i usually use htop , locate the process and then i use s key to strace the process

Thoughts about load balancing

I don't know what you are using for your load balancer, but in my professional career, I have found that NginX doesn't cut the mustard. Rather, HAProxy does a much better job at being a performant load balancer. We went from having consistent and regular outages in production, to 99.95% uptime after switching to HAProxy. Might be worth the look.

Also, we use dbshards on our SQL servers to get performance and reliability out of production. I believe MySQL supports sharding natively.

I suppose it's the wrong

I suppose it's the wrong response, but until you find a way to detect the spin condition that requires the restart (or rearchitect the memcache layer) is there any particular reason why the app servers can't be restarted automatically if they fail to check-in for 2 minutes or somesuch?

Thanks Evan

All I have to do here is posting a comment to thank you for all this effort.

I don't know you so much (only from here, internet) but this way to tell us what is going on with identi.ca, our favorite microblog platform, is very transparent and frontal.

I know is not new what i'm doing now, but i think we sometimes forget all the effort behind something we learned to use and we can't stay and live without.

That's all and thank you again from Tierra del Fuego, Argentina

Post new comment

Please note that blog comments are not monitored by our support staff. If you need assistance please visit our forums at forum.status.net or see the Support page for other options.
The content of this field is kept private and will not be shown publicly.
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.