Welcome to the Jungle: Netflix Chaos Monkey Goes Open Source

Posted in Industry – 30 July 2012 – No comments
Uptime, uptime, uptime. It’s the modern mantra that drives IT in the Era of the Outage. Now Netflix has let loose Chaos Monkey, a potent new tool for ensuring your applications can stay afloat after losing one or more instances. The basic idea is that the Chaos Monkey works between 9am and 3pm to randomly kill roughly one instance in an AWS autoscaling group each week. Developers and operators can then work together during regular business hours to overcome the micro-outage and work on system designs that will better support such random events in the future.

Not Curious George
Chos Monkey is coming for your cloud instances. (Photo credit: jcoterhals)

Netflix offers further details on its blog:

“The default instance groupings that Chaos uses for selection is Amazon’s Auto Scaling Group (ASG). Within an ASG, Chaos Monkey will select an instance at random and terminate it. The ASG should detect the instance termination and automatically bring up a new, identically configured, instance. If you are not using Auto Scaling Groups that should be the first step to making your application handle these isolated instance failure scenarios.”

To date, Chaos Monkey has been optimized for use on Amazon, but its creators claim it is flexible enough to run on any other cloud with a little extra coding. They also say to expect more of the Simian Army to go open source in the coming months, including Chaos Gorilla, the BFG9000 of outages, designed to simulate an entire AWS availability zone outage.

