AO3 News

Post Header

Published:
2015-11-04 22:53:27 UTC
Tags:

At approximately 23:00 UTC on October 24, the Archive of Our Own began experiencing significant slowdowns. We suspected these slowdowns were being caused by the database, but four days of sleuthing led by volunteer SysAdmin james_ revealed a different culprit: the server hosting all our tag feeds, Archive skins, and work downloads. While a permanent solution is still to come, we were able to put a temporary fix in place and restore the Archive to full service at 21:00 UTC on October 29.

Read on for the full details of the investigation and our plans to avoid a repeat in the future.

Incident Summary

On October 24, we started to see very strange load graphs on our firewalls, and reports started coming in via Twitter that a significant number of users were getting 503 errors. There had been sporadic reports of issues earlier in the week as well, but we attributed these to the fact that one of our two front-end web servers had a hardware issue and had to be returned to its supplier for repair. (The server was returned to us and put back into business today, November 4.)

Over the next few days, we logged tickets with our MySQL database support vendor and tried adjusting our configuration of various parts of the system to handle the large spikes in load we were seeing. However, we still were unable to identify the cause of the issue.

We gradually identified a cycle in the spikes, and began, one-by-one, to turn off internal loads that were periodic in nature (e.g., hourly database updates). Unfortunately, this did not reveal the problem either.

On the 29th of October, james_ logged in to a virtual machine that runs on one of our servers and noticed it felt sluggish. We then ran a small disc performance check, which showed severely degraded performance on this server. At this point, we realised that our assumption that the application was being delayed by a database problem was wrong. Instead, our web server was being held up by slow performance from the NAS, which is a different virtual machine that runs on the same server.

The NAS holds a relatively small amount of static files, including skin assets (such as background images), tag feeds, and work downloads (PDF, .epub, and .mobi files of works). Most page requests made to the Archive load some of these files, which are normally delivered very quickly. But because the system serving those assets was having problems, the requests were getting backed up until a point where a cascade of them timed out (causing the spikes and temporarily clearing the backed-up results).

To fix the issue, we had to get the NAS out of the system. The skin assets were immediately copied to local disc instead, and we put up a banner warning users that tag feeds and potentially downloads would need to be disabled. After tag feeds were disabled, the service became more stable, but there were further spikes. These were caused by our configuration management system erroneously returning the NAS to service after we disabled it.

After a brief discussion, AD&T and Systems agreed to temporarily move the shared drive to one of the front-end servers. This shared drive represents a single point of failure, however, which is undesirable, so we also agreed to reconfigure the Archive to remove this single point of failure within a few months.

Once the feeds and downloads were moved to the front-end server, the system became stable, and full functionality returned at 21:00 (UTC) on the 29th of October.

We still do not know the cause of the slowdown on the NAS. Because it is a virtual machine, our best guess is that the problem is with a broken disc on the underlying hardware, but we do not have access to the server itself to test. We do have a ticket open with the virtual machine server vendor.

The site was significantly affected for 118 hours and our analytics platform shows a drop of about 8% in page views for the duration of the issue, and that pages took significantly longer to deliver, meaning we were providing a reduced service during the whole time.

Lessons Learnt

  • Any single point of failure that still remains in our architecture must be redesigned.
  • We need to have enough spare capacity in servers so that in the case of hardware failure we can pull a server from a different function and have it have sufficient hardware to perform in its new role. For instance, we hadn't moved a system from a role as a resque worker into being a front-end machine because of worries about the machine's lack of SSD. We have ordered the upgrades for the two worker systems so they have SSDs so that this worry should not reoccur. Database servers are more problematic because of their cost. However, when the current systems are replaced, the old systems will become app servers, but could be returned to their old role in an emergency at reduced usage.
  • We are lacking centralised logging for servers. This would have sped up the diagnostic time.
  • Systems should have access to a small budget for miscellaneous items such as extra data center smart hands, above the two hours we already have in our contract. For example, a US$75 expense for this was only approved on October 26 at 9:30 UTC, 49 hours after it was requested. This forces Systems to have to work around such restrictions and wastes time.
  • We need to be able to occasionally consult specialized support. For instance, at one point we attempted to limit the number of requests by IP address, but this limit was applied to our firewall's IP address on one server but not the other. We would recommend buying support for nginx at US$1,500 per server per year.

Technologies to Consider

  • Investigate maxscale rather than haproxy for load balancing MySQL.
  • Investigate RabbitMQ as an alternative to resque. This makes sense when multiple servers need to take an action, e.g. cache invalidation.