At the beginning of February we experienced a database outage that lasted for a couple of days. In this post we will go through the different stages we encountered and our approach to solving those issues now and in the future.
We are deeply sorry for this episode and the interruption in service this has caused and hope we can clear your concerns through this blog but especially our future work.
Database problems protocol
Thursday, January 30
For the first time, an
ActiveRecord::ConnectionTimeoutError popped up in our exception tracking. This exception is raised, if
ActiveRecord, Rails’ ORM library, can’t establish a database connection within a certain timeframe (5 seconds, by default). Usually this indicates that the database connection pool which ActiveRecord manages is too small, so ActiveRecord isn’t able to acquire a connection from the pool, because all of them are already in use.
Our first approach was to increase the connection pool limit and monitor the situation.
Monday, February 3rd and Tuesday, February 4th
At first, it looked like increasing the connection pool had solved the problem. But eventually the exception popped up again. We tweaked the code that was responsible for calculating the connection pool size for each of our Rails processes. We audited our code to find out if some database transactions were neither committed nor rolled back. The exception popped up every hour already and the only way we could temporarily resolve it was by restarting the whole web application. This affected a lot of our users and customers as regular error messages started to show up.
Wednesday, February 5th
We were already using the Heroku integration from Librato and built a special dashboard that focused on our most important database and Heroku dyno metrics to be able to quickly determine any errors. We noticed that database system load was high and the available memory for caching was low. Looking at a live-view of running database queries, we noticed that certain queries seemed to run for minutes. They seemed to be blocking database connections. We decided to upgrade our database to increase memory and hoped that this would solve the connection problem.
Heroku offers a nice feature called “database followers” which lets you provision a new database and tell it to follow another database. This “follower database” will be an identical copy of the original database and all changes to the original database will be available at the follower database, usually within a split second.
It was around 9pm UTC when the follower database was ready. This is the time of the day when we receive most traffic on our application, but we had no other choice than to switch over to the new database. This took around 1 minute of downtime.
We watched our metrics right after the switch: The application seemed much slower than before. This was expected as the database caches still needed to warm up a bit. After 1 hour it was still slow. Additionaly we didn’t receive any database metrics from Heroku. Since this problems didn’t get better we decided to create another follower database at around 12am UTC because the current one seemed flawed.
Thursday, February 6th
After switching to the second follower database at around 2am UTC and monitoring for a few hours everything seemed resolved.
A few hours later, another outage had occurred. The new database didn’t perform as expected. We noticed that certain queries were slower than others, in particular queries that involved certain data records. We performed
ANALYZE TABLE table_name on the tables containing these data records.
ANALYZE collects statistics about the contents of tables in the database, and stores the results in the pg_statistic system catalog. Subsequently, the query planner uses these statistics to help determine the most efficient execution plans for queries. – http://www.postgresql.org/docs/9.3/static/sql-analyze.html
Friday, February 7th
During the early hours of February 7th we received a notification by Heroku that the machine our database was running on had died and they moved the database to a new machine. This didn’t have anything to do with our previous database problems.
During the day we saw long-running queries again involving certain data records. It looked like some records were damaged, so we decided to recreate our database completely. We scheduled a maintenance window for the next day.
Saturday, February 8th
We put our application into maintenance mode and pulled a backup from our database. We created a new database with the next higher size on Heroku and restored the backup. We monitored our database over the next days and didn’t see any more database errors.
Over the last few days we’ve done further load tests to determine if higher load or more long running requests could cause further problems. Those tests aren’t finished, but so far they have been successful in keeping up with much higher load. There are certain optimisations in our codebase we will work on in the future to make sure we reduce the load placed on specific parts of our system or reimplement those infrastructure parts to be able to deal with much higher load.
Our learnings and improvements
We agreed on a number of measures to prevent this from happening again:
We’ve now got a better understanding of problematic areas in our application and how we can fix them.
Better and stricter monitoring is in place that will notify us if any of those or similar errors happen again in the future so we can act immediately.
Our practices to deal with database errors have improved so in case of larger problems we can switch to different databases or maintain it more efficiently and with minimal interruption.
Stronger load testing to put our application under high load will be regular events to determine possible points of failure.
Heroku and it’s Postgres cloud solution have helped us tremendously in the past to bring our product to market and iterate quickly. We still consider them to be the best service for us at this point. To be able to deal with faulty databases or problems in our infrastructure we needed and need to improve our practices and monitoring so we can keep the extremely high quality of service we want to provide to our customers. Especially as we do not have full access to those Postgres instances and need to rely on other data to determine faulty state.
We will keep pushing our tools and workflows that these interruptions aren’t happening in the first place, but if they occur are dealt with swiftly.
We are very sorry that this happened and sincerely hope that through our efforts we can rebuild the trust that might have been lost during this episode. As always we would love to get your feedback or suggestions on how we can improve and what your expectations and worries are.
Ship long and prosper!