AO3 News

Post Header

Since the end of December 2024, AO3 has had numerous periods of slowness, downtime, and related issues such as missing kudos emails and delayed invitations. We've been taking some steps to improve the situation, but we are also working on some highly time-sensitive updates to our infrastructure, so we can't spend as much time as we'd like on performance improvements. We expect some slowness and downtime to continue until our new servers are delivered and installed in a few months.

We first noticed some strain on the servers we use for Elasticsearch (which powers searching and filtering) in the middle of last year. The new servers we wanted weren't available yet, so we repurposed some of our other servers to help with the load on Elasticsearch until we could get the hardware.

Unfortunately, the hardware wasn't available on its October release date, and our temporary fix couldn't hold up to the traffic increase we experience at the end of every year. This has led to periods of noticeable slowness over the last several weeks.

The servers we wanted finally became available in early January, and we completed the process of getting quotes and requisitioning them by January 15. Our purchase was confirmed on January 28, but it will take a few months for the servers to be delivered and installed.

We estimate the new Elasticsearch servers will be in place by early April. Until then, you might run into the following issues, especially during busy periods:

  • all pages loading more slowly
  • Elasticsearch-powered pages like search results and work and bookmark listings taking longer to update
  • error pages
  • automated checks from Cloudflare's Under Attack mode
  • stricter rate limiting
  • issues with services like the Wayback Machine or Tumblr RSS accounts that rely on bots, scrapers, or other automated tools, which we have deprioritized in favor of traffic from users

In addition to new Elasticsearch servers, we'll be purchasing five database servers to improve the capacity and resilience of our database cluster. We don't currently have enough database power to handle increased traffic and do certain types of maintenance at the same time. This means we sometimes have to take AO3 offline to resolve database issues, as we did for our February 7 maintenance. Additional hardware should help us avoid this situation in the future, but it will take some time for the purchase to be completed and the servers to be installed. We do not anticipate any database issues while we wait and there is no risk of data loss.

We're very sorry for the disruptions, and we appreciate your patience and your generous donations, which fund purchases like these.

For updates on slowness, downtime, or other issues, please follow @AO3_Status on Twitter/X or ao3org on Tumblr. We're also in the process of setting up a status account on Bluesky and a status page, but they're still works in progress and might not receive all updates just yet, so please make sure to check Twitter/X or Tumblr for a fully accurate list of updates.

Comment

Post Header

The Archive has seen a marked uptick in traffic during March, with weekly page views increasing from 262 million to 298 million in just two weeks. We expect this trend to continue, and in order to keep the site running, we need to take emergency measures. The quickest, most helpful change we can make is caching the works we serve to logged out users. Unfortunately, this means that starting immediately, logged out users may experience a delay in work updates, and hits from logged out users will no longer be included in works' hit counts.

Why is this happening?

The increased traffic is putting a strain on our database servers, which receive dozens of requests every time someone loads a work. (We plan to order new hardware to help with this strain, but first we need to finish some ongoing server maintenance and determine which hardware to order. Delivery and installation of servers ordinarily takes a few months, and there may be unexpected delays due to the pandemic.)

Serving cached copies of works to logged out users will drastically reduce the number of database requests we make. Caching means we don't have to ask the database for the latest information every time someone visits a certain page in a given period of time. Instead, one of our front end servers gives everyone the exact same copy of that page. After about an hour, that copy is updated.

What changes will I notice?

Starting immediately, you may notice the following changes:

  1. When a new chapter is posted, logged out users will only be able to access it by direct link until the cache updates, which will happen about once every 60 minutes. Other changes to the work (e.g., edits made by the creator or new comments or kudos that have been left) may also not be visible to logged out users until the cache is updated.
  2. Because work pages need to be identical for all logged out users, we've had to stop automatically filling in guests' names and emails on the comment form. (You can request an invitation and create an account if you'd like the form to remember you!)
  3. Logged out users will see the adult content warning on every work rated either Mature, Explicit, or Not Rated. This is temporary and will be fixed as soon as possible.
  4. New hits from logged out users will not be added to works' hit counts. (Existing hits will not be lost.) The code that increases hit counts lives on our application servers, so it will not run when the front end servers hand out cached copies of works.

(This section was updated at 00:25 UTC April 1.)

Will hit counts be fixed?

We are exploring options that will allow us to resume counting hits from logged out users, but it may take some time to find and implement a viable long term solution. We'll work as quickly as we can, but we ask for your patience -- our volunteers may need to prioritize additional performance improvements or their own wellbeing in these stressful times.

We will keep you updated on any significant progress or setbacks here on AO3 News and on our @AO3_Status Twitter feed.

Updated 11:00 UTC April 24: We have deployed new code that allowed us to resume counting hits from logged out users, along with some general changes to how hits are being measured.

Comment

Post Header

Published:
2019-11-15 23:54:27 UTC
Tags:

Over the last few weeks, you may have noticed a few brief periods when the Archive has been slow to load (or refusing to load at all). This is because our Elasticsearch servers are under some strain. We've made some adjustments to our server setup, but because this is our busiest time of year, we expect the problems to continue until we're able to have new servers delivered and installed in a few months.

We've been planning this server purchase for a while now, but the machines we wanted -- AMD's Epyc Rome CPUs, which have an increased core count and are cheaper than the Intel equivalent -- didn't come on the market until August. Now that they've been released, we're working on finding the best price to help us make the most of your generous donations. We expect to order them very soon.

While we're waiting for our new servers, we plan to upgrade the Elasticsearch software to see if the newer version offers any performance improvements. We hope this upgrade and the changes to our server setup will keep things from getting much worse during our end-of-year traffic influx.

Thank you for your patience, and for all the donations that allow us to buy new hardware when these situations arise!

Update 27 November: The servers have been ordered, but it will still be a few months before they are delivered and installed.

Comment

Post Header

Published:
2017-03-21 19:17:26 UTC
Tags:

Update, April 4: We successfully deployed an improved version of the code referenced in this post on March 29. It now takes considerably less time to add a work to the database.

-

You may have noticed the Archive has been slow or giving 502 errors when posting or editing works, particularly on weekends and during other popular posting times. Our development and Systems teams have been working to address this issue, but our March 17 attempt failed, leading to several hours of downtime and site-wide slowness.

Overview

Whenever a user posts or edits a work, the Archive updates how many times each tag on the work has been used across the site. During this time, the record is locked and the database cannot process other changes to those tags. This can result in slowness or even 502 errors when multiple people are trying to post using the same tag. Because all works are required to use rating and warning tags, works' tags frequently overlap during busy posting times.

Unfortunately, the only workaround currently available is to avoid posting, editing, or adding chapters to works at peak times, particularly Saturdays and Sundays (UTC). We strongly recommend saving your work elsewhere so changes won’t be lost if you receive a 502.

For several weeks, we’ve had temporary measures in place to decrease the number of 502 errors. However, posting is still slow and errors are still occurring, so we’ve been looking for more ways to use hardware and software to speed up the posting process.

Our Friday, March 17, downtime was scheduled so we could deploy a code change we hoped would help. The change would have allowed us to cache tag counts for large tags (e.g. ratings, common genres, and popular fandoms), updating them only periodically rather than every time a work was posted or edited. (We chose to cache only large tags because the difference between 1,456 and 1,464 is less significant than the difference between one and nine.) However, the change led to roughly nine hours of instability and slowness and had to be rolled back.

Fixing this is our top priority, and we are continuing to look for solutions. Meanwhile, we’re updating our version of the Rails framework, which is responsible for the slow counting process. While we don’t believe this upgrade will be a solution by itself, we are optimistic it will give us a slight performance boost.

March 17 incident report

The code deployed on March 17 allowed us to set a caching period for a tag’s use count based on the size of the tag. While the caching period and tag sizes were adjusted throughout the day, the code used the following settings when it was deployed:

  • Small tags with less than 1,000 uses would not be cached.
  • Medium tags with 1,000-39,999 uses would be cached for 3-40 minutes, depending on the tag’s size.
  • Large tags with at least 40,000 uses would be cached for 40-60 minutes, but the cache would be refreshed every 30 minutes. Unlike small and medium tags, the counts for large tags would not update when a work was posted -- they would only update during browsing. Refreshing the cache every 30 minutes would prevent pages from loading slowly.

We chose to deploy at a time of light system load so we would be able to fine tune these settings before the heaviest weekend load. The deploy process itself went smoothly, beginning at 12:00 UTC and ending at 12:14 -- well within the 30 minutes we allotted for downtime.

By 12:40, we were under heavy load and had to restart one of our databases. We also updated the settings for the new code so tags with 250 or more uses would fall into the “medium” range and be cached. We increased the minimum caching period for medium tags from three minutes to 10.

At 12:50, we could see we had too many writes going to the database. To stabilize the site, we made it so only two out of seven servers were writing cache counts to the database.

However, at 13:15, the number of writes overwhelmed MySQL. It was constantly writing, making the service unavailable and eventually crashing. We put the Archive into maintenance mode and began a full MySQL cluster restart. Because the writes had exceeded the databases' capabilities, the databases had become out of sync with each other. Resynchronizing the first two servers by the built-in method took about 65 minutes, starting at 13:25 and completing at 14:30. Using a different method to bring the third recalcitrant server into line allowed us to return the system to use sooner.

By 14:57, we had a working set of two out of three MySQL servers in a cluster and were able to bring the Archive back online. Before bringing the site back, we also updated the code for the tag autocomplete, replacing a call that could write to the database with a simple read instead.

At 17:48, we were able to bring the last MySQL server back and rebalance the load across all three servers. However, the database dealing with writes was sitting at 91% load rather than the more normal 4-6%.

At 18:07, we made it so only one app server wrote tags’ cache values to the database. This dropped the load on the write database to about 50%.

At 19:40, we began implementing a hotfix that significantly reduced writes to the database server, but having all seven systems writing to the database once more put the load up to about 89%.

At 20:30, approximately half an hour after the hotfix was finished, we removed the writes from three of the seven machines. While this reduced the load, the reduction was not significant enough to resolve the issues the Archive was experiencing. Nevertheless, we let the system run for 30 minutes so we could monitor its performance.

Finally, at 21:07, we decided to take the Archive offline and revert the release. The Archive was back up and running the old code by 21:25.

We believe the issues with this caching change were caused by underestimating the number of small tags on the Archive and overestimating the accuracy of their existing counts. With the new code in place, the Archive began correcting the inaccurate counts for small tags, leading to many more writes than we anticipated. If we're able to get these writes under control, we believe this code might still be a viable solution. Unfortunately, this is made difficult by the fact we can’t simulate production-level load on our testing environment.

Going forward

We are currently considering five possible ways to improve posting speed going forward, although other options might present themselves as we continue to study the situation.

  1. Continue with the caching approach from our March 17 deploy. Although we chose to revert the code due to the downtime it had already caused, we believe we were close to resolving the issue with database writes. We discovered that the writes overwhelming our database were largely secondary writes caused by our tag sweeper. These secondary writes could likely be reduced by putting checks in the sweeper to prevent unnecessary updates to tag counts.
  2. Use the rollout gem to alternate between the current code and the code from our March 17 deploy. This would allow us to deploy and troubleshoot the new caching code with minimal interruption to normal Archive function. We would be able to study the load caused by the new code while being able to switch back to the old code before problems arose. However, it would also make the new code much more complex. This means the code would not only be more error-prone, but would also take a while to write, and users would have to put up with the 502 errors longer.
  3. Monkey patch the Rails code that updates tag counts. We could modify the default Rails code so it would still update the count for small tags, but not even try to update the count on large tags. We could then add a task that would periodically update the count on larger tags.
  4. Break work posting into smaller transactions. The current slowness comes from large transactions that are live for too long. Breaking the posting process into smaller parts would resolve that, but we would then run the risk of creating inconsistencies in the database. In other words, if something went wrong while a user was updating their work, only some of their changes might be saved.
  5. Completely redesign work posting. We currently have about 19,000 drafts and 95,000 works created in a month, and moving drafts to a separate table would allow us to only update the tag counts when a work was finally posted. We could then make posting from a draft the only option. Pressing the "Post" button on a draft would set a flag on the entry in the draft table and add a Resque job to post the work, allowing us to serialize updates to tag counts. Because the user would only be making a minor change in the database, the web page would return instantly. However, there would be a wait before the work was actually posted.
  6. The unexpected downtime that occurred around noon UTC on Tuesday, March 21, was caused by an unusually high number of requests to Elasticsearch and is unrelated to the issues discussed in this post. A temporary fix is in currently in place and we are looking for long term solutions.

Comment

Post Header

Published:
2015-03-18 18:37:57 UTC
Tags:

Banner by Diane with the outlines of a man and woman speaking with word bubbles, one of which has the OTW logo and the other which says 'OTW Announcement'

AO3 and Fanlore users take note: both sites will be down for scheduled maintenance this Friday, 20th March. The maintenance will take place from 17:30 to approximately 20:45 UTC. What time is that for me?

Note that maintenance may also take slightly longer than expected, so there is no need to contact us if you cannot access the site at exactly 20:45 UTC.

For site status updates about AO3 and Fanlore please follow @AO3_Status and @fanlore_news on Twitter.

Comment

Post Header

Published:
2014-01-23 21:26:51 UTC
Tags:

If you're a regular Archive visitor or if you follow our AO3_Status Twitter account, you may have noticed that we've experienced a number of short downtime incidents over the last few weeks. Here's a brief explanation of what's happening and what we're doing to fix things.

The issue

Every now and then, the volume of traffic we get and the amount of data we're hosting starts to hit the ceiling of what our existing infrastructure can support. We try to plan ahead and start making improvements in advance, but sometimes things simply catch up to us a little too quickly, which is what's happening now.

The good news is that we do have fixes in the works: we've ordered some new servers, and we hope to have them up and running soon. We're making plans to upgrade our database system to a cluster setup that will handle failures better and support more traffic; however, this will take a little longer. And we're working on a number of significant code fixes to improve bottlenecks and reduce server load - we hope to have the first of those out within the next two weeks.

One area that's affected are the number of hits, kudos, comments, and bookmarks on works, so you may see delays in those updating, which will also result in slightly inaccurate search and sort results. Issues with the "Date Updated" sorting on bookmark pages will persist until a larger code rewrite has been deployed.

Behind the scenes

We apologize to everyone who's been affected by these sudden outages, and we'll do our best to minimize the disruption as we work on making things better! We do have an all-volunteer staff, so while we try to respond to server problems quickly, sometimes they happen when we're all either at work or asleep, so we can't always fix things as soon as we'd like to.

While we appreciate how patient and supportive most Archive users are, please keep in mind that tweets and support requests go to real people who may find threats of violence or repeated expletives aimed at them upsetting. Definitely let us know about problems, but try to keep it to language you wouldn't mind seeing in your own inbox, and please understand if we can't predict immediately how long a sudden downtime might take.

The future

Ultimately, we need to keep growing and making things work better because more and more people are using AO3 each year, and that's something to be excited about. December and January tend to bring a lot of activity to the site - holiday gift exchanges are posted or revealed, people are on vacation, and a number of fandoms have new source material.

We're looking forward to seeing all the new fanworks that people create this year, and we'll do our best to keep up with you! And if you're able to donate or volunteer your time, that's a huge help, and we're always thrilled to hear from you.

Comment

Post Header

Published:
2013-11-20 19:45:01 UTC
Tags:

Update December 14, 18:00 UTC: As of this week, all systems should be back to normal. We're still working on optimizing our server settings, so very brief downtimes for maintenance should be expected. If bookmarks still won't sort correctly for you - we're working on a more permanent fix to the underlying issue, but it might be a short while yet. As always, we're keeping an eye on Support tickets and messages to our Twitter account, and will react as quickly as possible if anything seems off. Thank you all for your patience.

Update December 3, 16:00 UTC: We have re-enabled the sort and filter sidebar on work listings only. Bookmark filtering and sorting is still turned off and will likely be off for a few more days. (The filters are the sidebar that allows you to narrow down a list of works or bookmarks by character, rating, etc.) We will continue to work on the underlying issue. In the meantime, we suggest using the Works Search to help find what you’re looking for.

All works and bookmarks should be showing up normally. Work re-indexing is complete, so we hope to be able to turn on filtering for works again in the next day or two.

Bookmark re-indexing is still ongoing, so it will be several days before we can turn bookmark filtering back on.

Please follow the @AO3_Status Twitter feed or check back here for further updates.

Update 2 Dec: Listings for works, bookmarks, tags, and pseuds are unavailable due to issues with our search index. Our coding and systems volunteers are currently looking into it, and we will keep you updated on our progress. Our Support team is working on a back log, so there might be delays in getting back to users individually. Please consider checking the @AO3_Status Twitter feed or our banner alerts instead.

Update 30 Nov: All bookmarks have been re-indexed and should show up correctly again. Any issues that might still be lingering will be sorted out when we upgrade Elasticsearch, which we're planning for mid-December. Downloads should be working without the need for any workarounds now. Thank you for your patience!

The Good

We recently deployed new code, which fixed a couple of very old bugs and introduced improvements to the kudos feature. Behind the scenes, we've been working on setting up new servers and tweaking server settings to make everything run a little more smoothly during peak times. The end of the year (holiday season in many parts of the world) usually means more people with more free time to participate in more challenges, read more fic, or post more fanart, resulting in more site usage.

One way to measure site usage is looking at page views. This number tells us how many pages (a single work, a list of search results, a set of bookmarks in a collection, a user profile, etc. etc.) were served to users during a certain time frame. Some of these pages can contain a lot of information that has to be retrieved from the database - and a lot of information being retrieved from the database at the same time can result in site slowness and server woes. During the first week of January we had 27.6 million page views. As of November 17 we registered 42.9 million page views for the preceeding week.

We've watched our traffic stats grow dramatically over the years, and we've been doing our best to keep up with our users! Buying and installing more servers is one part of the solution, and we can't thank our all-volunteer Systems team enough for all their hard work behind the scenes. On the other hand, our code needs to be constantly reviewed and updated to match new demands.

Writing code that "scales" - that works well even as the site grows - is a complicated and neverending task that requires a thorough understanding of how all parts of the Archive work together, not just right now, but in six months, or a year, or two years. As we're all volunteers who work on the Archive in our free time (or during lunch breaks), and there are only a handful of us with the experience to really dig deep into the code, this is less straightforward than a server acquisition and will take a little more time.

The Bad

As such, we've been battling some site slowness, sudden downtimes (thankfully brief due to our awesome Systems team) and an uptick in error pages. We can only ask for your patience as we investigate likely causes and discuss possible fixes.

For the time being, we have asked our intrepid tag wranglers to refrain from wrangling on Sundays, as this is our busiest day and moving a lot of tags around sadly adds to the strain on the current servers. We sincerely apologize to all wrangling volunteers who have to catch up with new tags on Monday, and to users who might notice delays (e.g. a new fandom tag that's not marked as canonical right away). From what we've seen so far, this move has helped in keeping the site stable on weekends.

The Ugly

We are aware of an issue with seemingly "vanishing" bookmarks, in which the correct number of bookmarks is displayed in the sidebar, but not all are actually shown. The most likely culprit is our search index, powered by a framework called elasticsearch. All our information (work content, tags, bookmarks, users, kudos, etc. etc.) is stored in a database, and elasticsearch provides a quicker, neater access to some of this data. This allows for fast searches, and lets us build lists of works and bookmarks (e.g. by tag) without having to ask the database to give us every single scrap of info each time.

It appears now that elasticsearch has become slightly out of sync with the database. We are looking into possible fixes and are planning an elasticsearch software upgrade; however, we must carefully test it first to assure data safety.

This problem also affects bookmark sorting, which has been broken for several weeks now. We are very sorry! If you want to know if a particular work has been updated, please consider subscribing to the work (look for the "Subscribe" button at the top of the page). This will send you a notification when a new chapter has been posted.

(Note: Since we're sending out a lot of notifications about kudos, comments and subscriptions every day, some email providers are shoving our messages into the junk folder, or outright deny them passage to your account. Please add our email address [email protected] to your contacts, create a filter to never send our emails to spam, or check the new "Social" tab in Gmail if you're waiting for notifications.)

A problem with file downloads only cropped up fairly recently. We don't think this is related to the most recent deploy, and will investigate possible causes. In the meantime, if a .pdf or .mobi file gives you an error 500, try downloading the HTML version first, then give it another shot. This should help until we've fixed the underlying problem.

What You Can Do

If you have not already done so, consider subscribing to our twitter feed @AO3_Status or following us on Tumblr. You can also visit the AO3 News page for updates in the coming weeks or subscribe to the feed.

We thank everyone who has written in about their experiences, and will keep you all updated on our progress. Thank you for your patience as we work on this!

Comment

Post Header

Published:
2012-12-18 11:30:12 UTC
Tags:

It's the season of giving! So, we're pleased to announce that invitation requests are back on the AO3!

Once upon a time (i.e. six months ago), users with Archive accounts could request a few invitations to give out, allowing them to share the Archive with friends and help form communities of like-minded fans.

Unfortunately, earlier this year, as many of you may remember, the Archive was having serious performance issues (we saw the sad 502 page far too often). While our coders and systems team hurried to implement emergency fixes, it was decided that we needed stricter control of the number of accounts being created to reduce the likelihood of unexpected overload. (Generally, people browsing the site without being logged in put a certain amount of stress on the servers, but it's the account perks like bookmarking, subscribing, and accessing a full reading history that contribute to server load to a larger degree.) Back then only 100 invitations were issued to people in the queue each day, so additional user requests could have a serious impact! So, in June, the difficult decision was made to stop giving invitations to existing users. You can read more about what was going on then in our post, Update on AO3 performance issues.

Over the next five months our software upgrades and code improvements caught up with the demand. The queue rate was increased several times, most recently to 750 invitations per day. Given that, we've wanted to go back to giving out invitations to existing users, but there were a few issues to be resolved before we could start.

First, the request form had to be altered to set a maximum number of invitations that a user can request at once. Second, the 1,200 user requests that were in the list when it was shut down had to be addressed. Since we had no limits on how many invitations could be requested back then, we had quite a few requests for very large numbers. Due to limitations in the software, individually lowering those numbers now would require manually editing each request, as would granting only some of the requests at once rather than the whole list.

So, two decisions were made:

1) Everyone with a pending request will receive 1 invitation, just to clear out the backlog.

2) User requests are being re-opened! You can now request a maximum number of 10 invitations at one time. Even with this hard limit in place, we ask that everyone ask for only what they need at a time. Once we've hit the figurative switch and re-enabled this feature later today you will be able to request invitations from your Invite a friend page.

We very much appreciate all of our users, and we are proud of our growth this year, even through the bumpy times. We are glad that once again we can enable you to bring more people on board!

Comment


Pages Navigation