Issues With Posting Works (And What We're Doing to Solve Them)

Post Header

Published:

2017-03-21 19:17:26 UTC

Tags:

Update, April 4: We successfully deployed an improved version of the code referenced in this post on March 29. It now takes considerably less time to add a work to the database.

You may have noticed the Archive has been slow or giving 502 errors when posting or editing works, particularly on weekends and during other popular posting times. Our development and Systems teams have been working to address this issue, but our March 17 attempt failed, leading to several hours of downtime and site-wide slowness.

Overview

Whenever a user posts or edits a work, the Archive updates how many times each tag on the work has been used across the site. During this time, the record is locked and the database cannot process other changes to those tags. This can result in slowness or even 502 errors when multiple people are trying to post using the same tag. Because all works are required to use rating and warning tags, works' tags frequently overlap during busy posting times.

Unfortunately, the only workaround currently available is to avoid posting, editing, or adding chapters to works at peak times, particularly Saturdays and Sundays (UTC). We strongly recommend saving your work elsewhere so changes won’t be lost if you receive a 502.

For several weeks, we’ve had temporary measures in place to decrease the number of 502 errors. However, posting is still slow and errors are still occurring, so we’ve been looking for more ways to use hardware and software to speed up the posting process.

Our Friday, March 17, downtime was scheduled so we could deploy a code change we hoped would help. The change would have allowed us to cache tag counts for large tags (e.g. ratings, common genres, and popular fandoms), updating them only periodically rather than every time a work was posted or edited. (We chose to cache only large tags because the difference between 1,456 and 1,464 is less significant than the difference between one and nine.) However, the change led to roughly nine hours of instability and slowness and had to be rolled back.

Fixing this is our top priority, and we are continuing to look for solutions. Meanwhile, we’re updating our version of the Rails framework, which is responsible for the slow counting process. While we don’t believe this upgrade will be a solution by itself, we are optimistic it will give us a slight performance boost.

March 17 incident report

The code deployed on March 17 allowed us to set a caching period for a tag’s use count based on the size of the tag. While the caching period and tag sizes were adjusted throughout the day, the code used the following settings when it was deployed:

Small tags with less than 1,000 uses would not be cached.
Medium tags with 1,000-39,999 uses would be cached for 3-40 minutes, depending on the tag’s size.
Large tags with at least 40,000 uses would be cached for 40-60 minutes, but the cache would be refreshed every 30 minutes. Unlike small and medium tags, the counts for large tags would not update when a work was posted -- they would only update during browsing. Refreshing the cache every 30 minutes would prevent pages from loading slowly.

We chose to deploy at a time of light system load so we would be able to fine tune these settings before the heaviest weekend load. The deploy process itself went smoothly, beginning at 12:00 UTC and ending at 12:14 -- well within the 30 minutes we allotted for downtime.

By 12:40, we were under heavy load and had to restart one of our databases. We also updated the settings for the new code so tags with 250 or more uses would fall into the “medium” range and be cached. We increased the minimum caching period for medium tags from three minutes to 10.

At 12:50, we could see we had too many writes going to the database. To stabilize the site, we made it so only two out of seven servers were writing cache counts to the database.

However, at 13:15, the number of writes overwhelmed MySQL. It was constantly writing, making the service unavailable and eventually crashing. We put the Archive into maintenance mode and began a full MySQL cluster restart. Because the writes had exceeded the databases' capabilities, the databases had become out of sync with each other. Resynchronizing the first two servers by the built-in method took about 65 minutes, starting at 13:25 and completing at 14:30. Using a different method to bring the third recalcitrant server into line allowed us to return the system to use sooner.

By 14:57, we had a working set of two out of three MySQL servers in a cluster and were able to bring the Archive back online. Before bringing the site back, we also updated the code for the tag autocomplete, replacing a call that could write to the database with a simple read instead.

At 17:48, we were able to bring the last MySQL server back and rebalance the load across all three servers. However, the database dealing with writes was sitting at 91% load rather than the more normal 4-6%.

At 18:07, we made it so only one app server wrote tags’ cache values to the database. This dropped the load on the write database to about 50%.

At 19:40, we began implementing a hotfix that significantly reduced writes to the database server, but having all seven systems writing to the database once more put the load up to about 89%.

At 20:30, approximately half an hour after the hotfix was finished, we removed the writes from three of the seven machines. While this reduced the load, the reduction was not significant enough to resolve the issues the Archive was experiencing. Nevertheless, we let the system run for 30 minutes so we could monitor its performance.

Finally, at 21:07, we decided to take the Archive offline and revert the release. The Archive was back up and running the old code by 21:25.

We believe the issues with this caching change were caused by underestimating the number of small tags on the Archive and overestimating the accuracy of their existing counts. With the new code in place, the Archive began correcting the inaccurate counts for small tags, leading to many more writes than we anticipated. If we're able to get these writes under control, we believe this code might still be a viable solution. Unfortunately, this is made difficult by the fact we can’t simulate production-level load on our testing environment.

Going forward

We are currently considering five possible ways to improve posting speed going forward, although other options might present themselves as we continue to study the situation.

Continue with the caching approach from our March 17 deploy. Although we chose to revert the code due to the downtime it had already caused, we believe we were close to resolving the issue with database writes. We discovered that the writes overwhelming our database were largely secondary writes caused by our tag sweeper. These secondary writes could likely be reduced by putting checks in the sweeper to prevent unnecessary updates to tag counts.
Use the rollout gem to alternate between the current code and the code from our March 17 deploy. This would allow us to deploy and troubleshoot the new caching code with minimal interruption to normal Archive function. We would be able to study the load caused by the new code while being able to switch back to the old code before problems arose. However, it would also make the new code much more complex. This means the code would not only be more error-prone, but would also take a while to write, and users would have to put up with the 502 errors longer.
Monkey patch the Rails code that updates tag counts. We could modify the default Rails code so it would still update the count for small tags, but not even try to update the count on large tags. We could then add a task that would periodically update the count on larger tags.
Break work posting into smaller transactions. The current slowness comes from large transactions that are live for too long. Breaking the posting process into smaller parts would resolve that, but we would then run the risk of creating inconsistencies in the database. In other words, if something went wrong while a user was updating their work, only some of their changes might be saved.
Completely redesign work posting. We currently have about 19,000 drafts and 95,000 works created in a month, and moving drafts to a separate table would allow us to only update the tag counts when a work was finally posted. We could then make posting from a draft the only option. Pressing the "Post" button on a draft would set a flag on the entry in the draft table and add a Resque job to post the work, allowing us to serialize updates to tag counts. Because the user would only be making a minor change in the database, the web page would return instantly. However, there would be a wait before the work was actually posted.

The unexpected downtime that occurred around noon UTC on Tuesday, March 21, was caused by an unusually high number of requests to Elasticsearch and is unrelated to the issues discussed in this post. A temporary fix is in currently in place and we are looking for long term solutions.

Actions

Comments

Sorry, this news post doesn't allow comments.

Pages Navigation

azurelunatic Tue 21 Mar 2017 08:23PM UTC

Thank you for the detailed update!

Comment Actions
- Thread
ErzsebethBatoriova Tue 21 Mar 2017 08:28PM UTC

Thank you for the information. I was wondering why I kept having this issue with one of my stories, and apparently I must have been trying to upload them on the weekends during the peak hours. I'll try to upload it again during the week to see if that'll help.

Comment Actions
- Thread
1. hottestguyalive Mon 02 Jul 2018 10:26AM UTC
  
  Help. Trying to post a story and it's just not working......... When should i post again??
  
  Comment Actions
  - Thread
  - Parent Thread
sofiaottoman Tue 21 Mar 2017 09:25PM UTC

Thanks for explaining what's going on.

Comment Actions
- Thread
pikestaff Tue 21 Mar 2017 10:10PM UTC

Best of luck!

Comment Actions
- Thread
MedusaStone Tue 21 Mar 2017 10:18PM UTC

Thank you for keeping us in the loop about what's going on, and for working so hard to fix it.

Comment Actions
- Thread
ilovecas Tue 21 Mar 2017 10:33PM UTC

Hi I have an error message as below... please help!
Access Denied (policy_denied)
Your system policy has denied access to the requested URL.
For assistance, contact your network support team.

Comment Actions
- Thread
1. Dawn_Phoenix Thu 23 Mar 2017 01:19AM UTC
  
  Not on the AO3 team; just a passer-by!
  I think that's less to do with traffic on AO3 and more likely to do with potential security software installed on the comp. Or web-based filtering for security or content filtering purposes. It's very hard to say offhand. Maybe your issue's been resolved already, in which case, feel free to just ignore this. :)
  
  Comment Actions
  - Thread
  - Parent Thread
Code16 Tue 21 Mar 2017 10:34PM UTC

(Thank you for all the hard and amazing work you do that you've given us this glimpse of here!)

Comment Actions
- Thread
bluesailor Tue 21 Mar 2017 11:48PM UTC

Thanks for all your hard work! And thanks especially for being so good about keeping us informed every step of the way.

Comment Actions
- Thread
SnickerToodles Tue 21 Mar 2017 11:56PM UTC

Guess it's a good thing I've had writer's block. XD Good luck! I'm no web designer, but sites like FFN where small bits of story data are randomly changed or lost upon uploading (option 4), or it takes possibly hours for the works/updates to post (option 5), are... annoying. So I would avoid those fixes.

Comment Actions
- Thread
ChristyCorr Wed 22 Mar 2017 12:16AM UTC

Thank you for this post, and good luck! ♥

Comment Actions
- Thread
Account Deleted Wed 22 Mar 2017 02:09AM UTC

You all are brilliant, talented people doing difficult and complicated work on a system we love and don't but barely comprehend.
THANK YOU!

Comment Actions
- Thread
McBangle Wed 22 Mar 2017 02:10AM UTC

Thanks for the explanation!

Comment Actions
- Thread
msilverstar Wed 22 Mar 2017 02:13AM UTC

Thanks for the transparency!
In the long run, incorporating a rollout functionality seems like a worthy investment. It's impossible to pre-test for every corner case, so slow deployment and quick reversion functionality would be more robust.

Comment Actions
- Thread
Eternal Wed 22 Mar 2017 05:49AM UTC

;_;

Comment Actions
- Thread
Strange and Intoxicating -rsa- (strangeandintoxicating) Wed 22 Mar 2017 07:31AM UTC

Thank you guys for your constant hard work! We all love and support you. <3

Comment Actions
- Thread
Dynamic_Ideation Wed 22 Mar 2017 11:50AM UTC

I just want to say a huge THANK YOU for all your continued hard work. YOU GUYS ARE AWESOME!

Comment Actions
- Thread
Brooke (Guest) Wed 22 Mar 2017 11:59AM UTC

Thank you for the very detailed update !!!!

Comment Actions
- Thread
rusty_armour Wed 22 Mar 2017 01:00PM UTC

Thanks for your hard work and dedication! I hope finding a solution won't be too much of a headache. Best of luck. You have my support no matter what. :-)

Comment Actions
- Thread
iesika Wed 22 Mar 2017 02:28PM UTC

Thank you so much for your hard work and excellent communication! Just knowing in general terms what y'all are dealing with on the back end takes me right out of helpless-frustration and into "oh, just post on Tuesdays or something."
Y'all are the best!

Comment Actions
- Thread
Sophisticated_Dude Wed 22 Mar 2017 04:11PM UTC

Thanks for persevering in spite of these mind boggling issues (I don't know how any of this stuff works, all I know is I love posting here! :) ) and thinking up ways to fix the errors, you guys are incredible!

Comment Actions
- Thread

AO3 News