We are pleased to announce that our build infrastructure has been upgraded to Ubuntu Trusty. This means that your builds will run in an updated and more stable environment. We worked hard during the past couple of months to make this upgrade as smooth as possible.
We want to apologize for the Service Outage that happened on Thursday 7/31 starting at 6:30PM UTC. We caused you a lot of trouble and we are really sorry!
After digging into our logs, we reconstructed the series of events:
It started with poor database performance around 6:30PM UTC, which resulted in a growing backlog of events in our Sidekiq queues. As a result, we hit the memory limit of our Redis instance. This caused dropped jobs, since Sidekiq wasn’t able to enqueue more jobs.
Some of the major challenges today when building infrastructure are predictability, scalability and automated recovery. A predictable system will promote the exact same artifact that you tested into your production system so no intermittent failure can cause any trouble. A scalable system makes it trivial, especially automatically, to deal with any rise in traffic. And automated recovery will make sure your team can focus on building a better product and sleep during the night instead of maintaining infrastructure constantly.
At Codeship we’ve found that an Infrastructure made up of immutable components has helped us tremendously with these goals.
At Codeship we are quite heavy users of Rollbar, a fantastic service that “collects and analyzes errors on web and mobile apps so you can find and fix them faster”. They also provide you with an API to track deployments of your application. This allows you to gain more insight into your deployments, without you having to spending a lot of effort. You can, for example, track errors over multiple deployments and see if they got resolved or are happening again.
Of course, an optimal (Continuous) Deployment tool should integrate Rollbar (and every other useful service) directly. It should not force you to use shell commands. But as services come and go and development teams are limited, sometimes you have to do it yourself.
Pushing Deployment Information with Rollbar
Here is the command the team at Rollbar suggests to push deployment information to their service.
This should be extremely to easy to integrate into almost any deployment process. And the Rollbar team additionally provides a bunch of other integrations with tools such as Capistrano or Fabric in their documentation.
If you integrate this command into your deploy pipeline you get nice annotations in your graphs as well as an extremely well done integration throughout the rest of the app. For example Rollbar tries to figure out which deployment is responsible for a certain exception.
How Rollbar helps Codeship
We use Rollbar in all of our services. Of course we also track the deployments of these services. Most of the time we use it with a shell command quite similar to the one above, sometimes integrated with similar commands for other services. More than once this integration has helped us drill down into a certain exception so we could figure out which commit introduced it and how to fix it.
We want to thank the Rollbar team around Brian and Cory for their awesome service. If you are looking to integrate an exception tracking service into your application, make sure to check them out. Are you a Rollbar user? I would love to hear your thoughts in the comments!