One of the things I love about working at Civis is the opportunity we have for continuous learning. Not long ago I had the opportunity to be involved in a book club which read through Google’s Site Reliability Engineering book. One of the essays in this book addressed various methods for handling overload. From my own site reliability rotations, I knew that overload was a situation which we semi-frequently encountered. Ideally, we would simply avoid overload. Unfortunately, it’s unrealistic to assume that our application, or any application actually, can do so entirely. Because of this, we need to consider ways that we can make our web app handle overload more gracefully, in addition to looking for ways to prevent overload.
Google’s SRE book discusses a variety of ways to mitigate the effects of overload. Not all of their methods make sense in the context of our current system (we’re nowhere near as large and complex as Google) and some of them we already had implemented. Request shedding caught my attention though.
Request shedding, discussed in the SRE book under the section on criticality, is the idea is that requests to a backend should be tagged with a criticality value. Then, when the backend becomes overloaded, it can ignore less critical requests and focus on those which will affect your users the most. The goal here is that if your backend can’t do all of the work assigned to it, it should focus on only the most important work, providing the best service possible to your users while you and your team debug the underlying issue.
For Civis, overload in our web application was usually triggered by CPU overload on our MySQL database. If MySQL CPU was pegged at 100%, response times on our web app could slow to a crawl. This problem was made worse by the index pages on our web app, which make frequent API calls to request all objects of the indexed type. This polling allows the index pages to stay up to date automatically, without requiring users to refresh the page. However, it also contributes to MySQL load by triggering several API requests a minute whenever a user is sitting on an index page. When MySQL was overloaded, these calls would often start to stack up. The SRE on call might have to ask internal users to stop sitting on index pages, while we debugged the root cause. These polling requests were non-critical, but our backend treated them just the same as other requests.
Luckily, I had the chance to address this issue in a hackweek. Hackweeks are another great part of Civis’s culture. Once a quarter, each engineer is given a full week to dedicate to a project or experiment of their choice, something which isn’t part of their usual project work. This time around, I decided to implement request shedding. My goal was to help our MySQL database recover from overload quickly by dropping low priority requests at the application layer when approaching overload. This would increase our site’s reliability by improving responsiveness during MySQL overload and hopefully reducing occurrences of overload in the first place.
The first piece of this work was to detect when our MySQL database was approaching overload. Since our MySQL database is deployed on an RDS instance, metrics on it are available through AWS Cloudwatch. I set up a cron job to ping Cloudwatch and cache the current CPU utilization in our Ruby on Rails web app. Next, I added logic to our base api controller to check this cache and immediately return sheddable requests with an error if CPU utilization crossed a configurable threshold.
After that, I added criticality headers to our index polling requests. This turned out to be more difficult than expected since we make all of our API requests using AngularJS resources and this pattern makes it difficult to add headers to requests dynamically. The solution I decided on was to define a new polling action on the resource provider and use this action instead of the default query action whenever a sheddable request was made. The final piece was to test it out!
Fun fact: MySQL is performant enough that overloading my local database was more difficult than originally expected. I eventually achieved it by performing a cyclic redundancy check on a text field of our largest table, cross-joining it with the same check on the same table twice, and performing this check/join multiple times in new threads. It’s a fun change to write code intended for poor performance!
A few months later we decided to check how well the new request shedding logic was working, if at all. First, we checked our logs to see if any requests were being shed:
We were surprised to notice that we had shed over 1500 requests in less than an hour that day! So it was doing something. The next question was — was MySQL actually overloaded at that time? We checked on its CPU metrics through the AWS console:
And noticed a spiking pattern consistent with the number of requests being shed! The pattern becomes even more apparent when the graphs are overlaid:
From an informal poll of those who have been on call lately, it seems like MySQL overload issues are occurring less frequently, and when they do occur, engineers can spend more of their time actually debugging the issue because the system is taking care of mitigating the effects. Who knew a book club could be so practical?
About the Author:
Leanne Miller is a Software Engineer at Civis Analytics. She is the resident Book Club Tzar and is particularly fond of tea.