Implement Chaos engineering principal for finding system failures

Pinterest LinkedIn Tumblr

Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system’s capability to withstand turbulent and unexpected conditions.

The software industry moves to the distributed computing environment for getting advantages like reliability, scalability, High performance, and more future.  Microservice architecture is one of the architectures build on top of distributed computing. The microservices require to communicate multiple services synchronously or asynchronously. The software testing team tests the components individually. When the product moves to the production system, it requires real-time load testing. The microservices give more flexibility for developing and deploying the application on the production system. If the microservices fail, the dependency component may not be able to work properly or due to the internal failure of the service. The product design must think about software failure and prepare the components and work based on dependency components availability.

The CHAOS ENGINEERING principles help to build the confidence level for moving the system to production. It helps to test the system in unpredictable failures in a systematic way. Chaos engineering principle introduced by Netflix. When the software developer and QA build the system, it can fail due to application failure, network failure, infrastructure failure or dependency failure. Netflix tool team created ‘Chaos Monkey’ which destroy the software components randomly to help the system behavior. The chaos engineering explained in the following steps.

  • Steady-state: the system behaves the normal way with some measurable metrics output ex throughput, error rates, latency.
  • Hypothesize state: the steady state will continue in both the control group and the experimental group.
  • Run experiment: Run the real word incidents example server crash, application crash, malformed responses, or traffic spikes
  • Improve: test the hypothesis by comparing the steady state of the control group and the experimental group

When weaknesses find in the product, the team addresses the issue and fix. This process continues to confirm the product behavior.

Chaos engineering helps to prevent system failure and helps to test products in a destructive environment. The product may fail due to network glitches, hard disk failure, overloading of any functional component, application crashes, etc. We may not be able to prevent and test everything. But the resilient testing helps to prevent such failures. Chaos Monkey is a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact. After the success of Chaos Monkey, Netflix developed many tools to identify and test the product. 

  • Latency Monkey
  • Conformity Monkey
  • Doctor Monkey
  • Janitor Monkey
  • Security Monkey
  • 10–18 Monkey
  • Chaos Gorilla

We can find more details about the Netflix tools from Netflix technical blog

Failure in the distributed system is unavoidable. The product should design for supporting the chaos engineering principles.  The chaos engineering expects to execute the failure in a systematic way and supported components resilient mechanism.

Chaos engineering can be used to achieve resilience against:

  • Infrastructure failures
  • Network failures
  • Application failures

Simple Use case

The product has two services that communicate with each other. The user accesses the web application and web application communicate to service A. Service A communicate with Service B and database. Does chaos engineering help to understand how the system behavior when service A goes down? Service B goes down? What about user communication? We might design the application design and code to handle such use cases. But We should validate the system before moving to the production system.

Game day

A Game Day is a dedicated day focused on using Chaos Engineering to reveal weaknesses in the product. The team attacks the system either manually or automated scripts to destruct the system. It focusses on building more resilient systems by breaking things on purpose. The game day may be planned or unplanned. The planned game day preplanned well and involve all the members including the DevOps, development, QA, and other management members required to approve the system. The product may involve multiple components example Application (Services), Infrastructure, database, and other dependency components. Each area can be tested using separate team members or automated scripts or tools. The main goal to disturb the current system and fix the issues before face the actual issues on the production system.  The team members connected through chat, conference, or physically in the conference room to communicate easily and fix the issues.


Netflix designed Chaos Monkey to test system stability.  Many open source and commercial tools available for testing the product resilient.

Chaos Monkey

As per Chaos Monkey documentation, Chaos Monkey is responsible for randomly terminating instances in production to ensure that engineers implement their services to be resilient to instance failures.  The chaos monkey helps to fail the system randomly. The chaos monkey requires Spinnaker and Requires MySQL. Spinnaker is a free and open-source continuous delivery software platform originally developed by Netflix. It helps to manage the application and deployment.  The MySQL require to store the daily termination schedule and to enforce a minimum time between terminations. Chaos Monkey doesn’t have any recovery tools and user interface.

Chaos Toolkit

The Chaos Toolkit aims to be the simplest and easiest way to explore building your own Chaos Engineering Experiments. It builds on python and requires installing the required modules based on the user requirement. Chaos Toolkit drivers extend the toolkit to be able to cause chaos and probe different types of systems. It supports application, network, infrastructure drivers to test the products.  The user can develop own drivers and publish publicly. The Chaos Toolkit supports the containerized docker base image and can be used to test from docker images and Kubernetes. O’Reilly Learning Chaos Engineering by Russ Miles explained the example using chaos toolkit.

Chaos monkey spring boot

Chaos monkey spring boot project helps to fail the services, REST controller, controller, repository, and Component from spring boot.  A watcher is a Chaos Monkey for Spring Boot component, that will scan your app for a specific type of annotation. When the user adds Chaos monkey in the project, it enables using a spring boot profile. When the profile activates, it randomly fails the services. The user must add the following in the pom.xml file.

Once the application starts with the Chaos monkey profile, it fails randomly.  The Chaos monkey profile should not active in the production system mistakenly. The chaos monkey access using the following URI. It supports multiple service endpoints to enable and query the Chaos monkey properties.


<?xml version="1.0" encoding="UTF-8"?>
<project xmlns=""
		<relativePath /> <!-- lookup parent from repository -->
	<description>Demo project for Spring Boot</description>



		<!-- -->


		<!-- -->

		<!-- -->




		<!-- -->






#chaos monkey for spring boot props

# inlcude all endpoints
#Determine whether should execute or not
#How many requests are to be attacked. 1: attack each request; 5: each 5th request is attacked
#Minimum latency in ms added to the request
#Maximum latency in ms added to the request
#Latency assault active
#Exception assault active
#AppKiller assault active
#Controller watcher active
#RestController watcher active
#Service watcher active
#Repository watcher active
#Component watcher active

Simple Java

package com.careerdrill.learning;

import java.util.HashMap;
import java.util.Map;

import javax.json.Json;
import javax.json.JsonBuilderFactory;
import javax.json.JsonObject;

import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RestController;

public class SimpleController {

	public JsonObject getNames() {
		Map<String, Object> config = new HashMap<String, Object>();
        config.put("", Boolean.valueOf(true));
        JsonBuilderFactory factory = Json.createBuilderFactory(config);
        JsonObject value = factory.createObjectBuilder()
        	    .add("firstName", "John")
        	    .add("lastName", "Smith")
        	    .add("age", 25).build();
		return  value;



Write A Comment