Killing Instances with Chaos Monkey

06/28/2011 08:53 pm

To kick off this series, I thought I'd look at how you go about breaking your cloud setup once everything seems to be running nicely. Confidence grows with testing, and in cloud hosting -- and any service-based architecture -- confidence in the service grows every time you see servers come back from the dead.

If you're deploying your first cloud setup, how do you know that your cloud architecture actually works, and how do you know when you've finished? How do you know that you're setup is truly resilient rather than just a bit scalable?

If you're considering cloud for dev, test, or production the first thing you need to think: what happens when something fails? It's a common misconception that simply by using cloud resilience and scalability are built in. You don't get a hugely scalable site just by pushing your code to CloudServers or EC2.

Resilience is built in at the hardware level in most cloud offerings. Disks fail, network connections drop out, and through the magic of your cloud provider, you don't notice. But it's your job to assume failure at the levels of device, data-center, region, and internet and to use the cloud APIs to build a system that self-heals.

Challenge every part of your architecture and know what happens if it fails. That's one of the great benefits of anything cloud-based: you can trial new configurations and test scenarios very cheaply.

There are some tools and techniques to help with this.

Chaos Monkey tests your cloud by randomly taking down servers. Run it on your AWS account at your own risk. Read the disclaimer. But do run it so you can confidently say "if everything crashes, it'll be back in 2 minutes with no help from me."

Choas Monkey perfectly simulates quite a lot of normal behaviors. Instances run slowly and become unresponsive, or some unusual event happens that chews up all your inodes. Someone might deploy an insane and badly written script that takes down the database.

Once installed, you run Chaos Monkey like: 

> ChaosMonkey -l=output.txt -e=US-East -a=YOUR-ACCESS-KEY -s=YOUR-SECRET-KEY -t=chaos -v=1

What follows is the random killing of instances and unearthing all the assumptions in your architecture.

ChaosMonkey kills instances if they have the tag chaos=1, which means you can limit the damage it can cause. If you wanted to script this yourself and do something specific -- say, kill each database server every time it comes back -- then it only takes a few lines to throw something together in Ruby. The follow script does some of what ChaosMonkey does:

First install the fog gem with

gem install fog

Then set your credentials in the script and run it... remember, it is designed to break your servers.

require "fog"
  connection = => 'AWS',
    :aws_access_key_id => 'XXXXXXXXX',
    :aws_secret_access_key => 'XXXXXXXXXXXXXXXXXXXXXXXXXXX',
    :region => 'us-east-1'
 candidates = []
 connection.servers.all().each do |i|
   if i.tags["chaos"].to_i == 1
     candidates.push i
puts "We have #{candidates.size} candidates. We're going to KILL these instances. If stuff breaks, it is not my fault. Ok?.. type 'ok'."
if input.strip != 'ok'
  puts "Maybe next time"
# Pick off just one instance and kill it. You can run this as many times as you like to kill more.
kill_instance = candidates.shuffle.first
puts "Killing #{}"
puts "Check your app... is it still there?"

Suppose you want to kill only the smaller instances, because you think the new version of your app uses too much memory. Just change the tags line to:

connection.servers.all().each do |i|
   if i.tags["chaos"].to_i == 1 && i.flavor_id == 't1.micro' # Only kill micros
     candidates.push i

You know your architecture is good if you can kill instances, wait a moment, and everything is back as if by magic. New instances must start up and configure themselves, and the app must be there as if nothing happened.

If you don't feel comfortable unleashing the monkey on your setup, you should at least kill servers on a regular basis, even if you do this out of hours when no-one is looking.

Let's walk through a fairly basic AWS setup. Suppose it has RDS (relational database service) providing MySQL. You have a bank of EC2 instances running your app which sits behind a elastic load balancer (ELB). That's sitting behind another EC2 instance acting as a caching server.

Can you spot the points of failure? All the EC2 instances and the RDS instance; the load balancer. The EBS virtual disks can fail as well as anyone who experienced the AWS outage will know.

When each EC2 instance goes down, you want the rest to pick up the load and the cloud config to automatically repair itself. By killing instances in your cloud setup, and by shutting down services on the instances, you can simulate lots of things going wrong. You'll find some nasty condition that causes the servers to lock up and so you fix it... with APIs.

The first rule of cloud architecture is to allow for failure. This is allowed for by all parts of the architecture being programmatically managed: there's an API for everything. When something fails, which APIs can you use to fix your cloud? If you can't fix it with APIs, you need to rework your architecture.

Next time we'll look at some of the magic configuration tools that make this kind of self-repair possible.

Related content

  • MySQL as a Service

    If you need a hosted solution that provides all the features of MySQL, a MySQL-as-a-service product might be the database option you’re looking for.

  • A Real-World Look at Scaling to the Amazon Cloud
    The Amazon cloud environment adapts easily to custom solutions. We'll show you how one company built their solution in the cloud.
  • Getting started with the OpenStack cloud computing framework
    OpenStack brings common virtualization technologies such as KVM, Xen, Hyper-V, and QEMU into the cloud.
  • Exploring Ubuntu cloud tools
    Cloud computing, promising manageable, quickly deployed, virtual machines in large networks, is so appealing in a world where everyone is trying to cut costs. Ubuntu Enterprise Cloud Services (UEC) is Canonical's entry into the cloud market.
  • Build Your Own Cloud

    Cloud computing provides a variety of benefits including flexibility and control. However, in the case of infrastructure as a service, practically all public cloud service providers offer only a relatively small number of pre-configured virtual machine images. This is somewhat analogous to offering shoes without half-sizes or widths, and then – when the fit isn’t quite right – asking the customer to try on a brand or style that is sized a bit differently. The customer may end up with the right size, or the right brand, or the right style – but not all three. In the case of IaaS, this means settling for a server which does not fit the needs of an application in terms of CPU, memory or storage.

    SoftLayer has recently updated its renowned CloudLayer service with a Build Your Own Cloud capability. This new approach allows customers to configure the amount of CPU, memory and storage used by their cloud servers. With this enhanced capability, SoftLayer customers avoid the tradeoffs forced by other IaaS offerings. These tradeoffs typically include over provisioning, which leads to inefficient resource utilization and higher cost, or under provisioning, which leads to lower levels of performance and scalability. Rather than constraining the dimensions of your server, CloudLayer lets you customize your server to meet the specific needs of your application. After all, who knows your application best – you or your IaaS provider? Neovise believes that any organization wishing to run production applications in a public cloud needs to strongly consider SoftLayer. 


Dan Frost is Technical Director of, cloud hosting consultants and web developers based in London and Brighton, UK

Dan has been building cloud hosting, writing, and talking about the cloud since before it was trendy. Since he spun up his first AWS instance, he's been trying out new services and finding ways of getting more out of hardware without actually owning any of it.

Mon Tue Wed Thu Fri Sat Sun
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31