Further info on Identi.ca problems
I posted a few days ago about some technical problems with identi.ca. tl;dr version: our Web servers occasionally hit very high load and stop responding, hurting performance of the site. I've found out a few things in the last few days, so I thought I'd update here for those interested.
Having a high load on a server can come from several causes. It can either be due to high I/O (network connections) or high CPU usages. Direct problems can be repeated connections to non-responsive network services, or buggy software that works inefficiently or goes into an infinite loop. Load can also be spread over multiple processes, or just be one process hogging all the resources.
Here's what we're seeing: one Apache process, on our Web server, has an explosive growth in memory usage. You can see an example in this ps output. Process 26617 has allocated 4Gb of memory, which is causing swapping to virtual memory, which has slowed the server to a standstill.
This memory leak is kind of confusing, since we've got a 96Mb memory_limit set in our PHP configuration. Theoretically, StatusNet itself shouldn't be able to allocate this much memory.
Another point that seems worth noting is that my systems team already had checks in place for this. If the server is loaded, periodic checks will kill and restart Apache. That means that largely these servers have been recovering on their own. It also means that the problem happens more frequently than I thought -- once or twice an hour, not once a day.
At this point, I'm working on a few fronts. First, I'd like to restrict the amount of memory available to any one process on the system. That should prevent (I think) the issue of one process forcing the server to swap and dragging down everything else with it. I'm trying limits.conf (hasn't worked yet) and may venture into cgroups if that doesn't work out.
Second, I've tried to mitigate some of the effects of long-running Apache processes by tuning our Apache settings (including MaxRequestsPerClient) to prevent a process from building up a lot of memory over time.
Third, I'm trying to map individual hits to Apache processes so I can determine what exactly is making that process explode. I hope that I can identify what's causing this explosive memory allocation and fix it so it doesn't happen in the first place.
Thanks to everyone on identi.ca for their patience while I work this out. Still hacking, I promise!