How Can Chaos Monkey Help Us With Developing Reliable Systems?

Nowadays, software companies make a lot of effort to ensure the reliability of their systems and services. Carrying out functional and non-functional tests is an integral part of the entire software development process. Resilience testing is a section of the second group which ensures that applications perform well in real-life conditions.

One way of improving the resilience of software is by hosting it on cloud servers. However, it is not enough because the system failures still occur. The conclusion is that the best defense against unexpected failures is to fail often. Such an approach can be adopted by using Chaos Monkey. It follows the Principles of Chaos Engineering.

Why should we use it? Chaos Monkey can help us to verify whether our fallbacks are properly defined, and network latency and service breakdowns do not negatively impact our system. We should run Chaos Monkey in our staging environment and monitor how our system behaves. It would be good to simulate high traffic by load tests. If the metrics confirm that we can control the chaos, then there is no limitation to run Chaos Monkey also in the production environment. Such experiments will make us sure that nothing will surprise us in the future.

In this article we will aim on a dedicated implementation of Chaos Monkey for Spring Boot (CM4SP) applications which is maintained by Codecentric.

Let’s meet Monkeys that make chaos

Chaos Monkey consists of Watchers and Assaults. The graph shows how they are dependent.

Watcheris a Spring component that establishes Aspect for applying additional logic. There is a dedicated watcher created for each of the supported Spring annotations. It can proceed with assaults on all public methods in corresponding classes.

Supported Spring annotations:

@Controller
@RestController
@Service
@Repository
@Component

Assaultis one of the available attack types conducted by the watcher. The following assaults are possible:

Latency Assault
Exception Assault
AppKiller Assault
Memory Assault

How does it work?

Chaos Monkey is initialised by activating the chaos-monkey application profile. Then Spring Boot loads Chaos Monkey configuration and creates needed beans. Public classes are scanned to find these that contain one of the supported annotations. Then watchers can carry out assaults on them. The set of liable components can be limited by setting configuration property to explicitly indicate classes we want to attack. There are 4 possible types of attacks. At the same time, more than one can be active, but some configuration properties are common. The process of making chaos can be monitored by dedicated metrics.

Integration

We can integrate Chaos Monkey into our Spring Boot application in a simple way. No code changes are needed. The only things we should do are:

add chaos-monkey-spring-boot dependency:

de.codecentric chaos-monkey-spring-boot 2.2.0

initialise Chaos Monkey by activating spring profile chaos-monkey:

spring: profiles: active: chaos-monkey

There is only one requirement. Our project must contain spring-web dependency (spring-boot-starter-web or spring-boot-starter-webflux).

Chaos Monkey is disabled by default, so we can keep calm. No unexpected behaviour will surprise us until we don’t change it in the configuration.

Configuration

Chaos Monkey gives us the possibility to set the configuration in two different ways:

statically in configuration file (application.yml or application.properties),
at runtime through exposed endpoint /actuator/chaosmonkey.

Both can be mixed. We can set up all possible properties provided by CM4SB statically in the configuration file. However, it would be a good approach only in small, individual cases. Whereas configuration at runtime has one limitation – we are unable to change the status of active/inactive watchers.

The best approach is to define watchers statically and all other properties later at runtime. Furthermore, Chaos Monkey should not be enabled in the configuration file (just leave it as default – disabled).

The minimal required configuration is to define which Spring components should be attacked by Chaos Monkey. By default, the only service watcher is enabled. If we do not want to have our service components attacked, then we have to explicitly disable it. Setting flags for all other watchers is not required. In the below chunk of configuration, we set them explicitly just to show possible configuration parameters.

1
2
3
chaos: monkey: assaults: latencyActive: false watcher:
                              restController: true controller: false service: false #true by default repository: false
                              component: false

There is possibility to limit Chaos Monkey’s area of destruction. To achieve it we can set watchedCustomServices property to choose only the services that we want to attack. It is a list of public class and method names. Suitable watchermust be enabled, otherwise provided components will not be accessible.

1
2
3
4
"watchedCustomServices": [
                              "com.softwarehut.office.controller.OfficeController.open",
                              "com.softwarehut.office.controller.OfficeController.close",
                              "com.softwarehut.office.controller.CompanyController" ]

Runtime configuration

Chaos Monkey offers built-in endpoints exposed via HTTP or JMX that allow us to change the configuration at runtime. We chose HTTP. To enable it, a configuration similar to the below is required:

1
2
management: endpoint: chaosmonkey: enabled: true
                              endpoints: web: exposure: include: chaosmonkey

Exposed endpoints allow to:

get current Chaos Monkey configuration

GET: /chaosmonkey

check status whether Chaos Monkey is enabled or disabled,

GET: /chaosmonkey/status

enable/disable Chaos Monkey,

1
2
POST: /chaosmonkey/enable POST:
                              /chaosmonkey/disable

get statuses of watchers,

GET: /chaosmonkey/watchers

get and modify configuration of assaults

1
2
GET: /chaosmonkey/assaults POST:
                              /chaosmonkey/assaults

Carry out the assaults

Chaos Monkey for Spring Boot allows us to conduct 4 different types of attacks. They can be grouped by application context or type of activation.

In the first group, we will place Latency Assault and Exception Assault. Both of them depend on HTTP requests. To set the frequency of occurrence we specify the level. For example, level=3 determines that every third application request will be attacked by Chaos Monkey.

"level": 3,

To the second group, we will classify Memory Assault and KillApp Assault. They are not dependent on requests. For their activation, you can apply a scheduler.

Applicable cron expression can be set as a value of runtimeAssaultCronExpression configuration parameter.

1
2
"runtimeAssaultCronExpression": "*/1 * * * *
                              ?"

Latency Assault

The latency range is limited by 2 parameters: latencyRangeStart and latencyRangeEnd. Their values are validated and can’t be lower than 1 and higher that in java.lang.Integer.MAX_VALUE. Time unit is millisecond [ms].

1
2
3
4
POST: /chaosmonkey/assaults { "level": 3, "latencyActive":
                              true, "latencyRangeStart": 2000, "latencyRangeEnd": 9000, # disable other assaults
                              "exceptionsActive": false, "killApplicationActive": false, "restartApplicationActive":
                              false }

Exception Assault

Chaos Monkey allows setting custom exception that will be thrown. To apply any exception, its class name with the package has to be passed as type and arguments of its constructor in arguments array.

1
2
3
4
5
POST: /chaosmonkey/assaults { "level": 3,
                              "exceptionsActive": true, "exception": { "type": "java.lang.RuntimeException",
                              "arguments": [ { "className": "java.lang.String", "value": "Exception assault has been
                              carried out" } ] }, # disable other assaults "latencyActive": false,
                              "killApplicationActive": false, "restartApplicationActive": false }

Memory Assault

It attacks the memory of the Java Virtual Machine. For more configuration details take a look at the official documentation.

1
2
3
4
5
POST: /chaosmonkey/assaults { "memoryActive": true,
                              "memoryMillisecondsHoldFilledMemory": 90000, "memoryMillisecondsWaitNextIncrease": 100,
                              "memoryFillIncrementFraction": 0.90, "memoryFillTargetFraction": 0.95,
                              "runtimeAssaultCronExpression": "*/1 * * * * ?", # disable other assaults "latencyActive":
                              false, "exceptionsActive": false, "killApplicationActive": false }

KillApp Assault

This one is simple, it just closes an application according to the scheduler. It is run only when scheduler is defined – cron expression in runtimeAssaultCronExpression configuration parameter.

1
2
3
POST: /chaosmonkey/assaults { "killApplicationActive":
                              true, "runtimeAssaultCronExpression": "*/1 * * * * ?", # disable other assaults
                              "latencyActive": false, "exceptionsActive": false, "memoryActive": true, }

Monitoring

Chaos Monkey for Spring Boot has no built-in generating of reports. It provides metrics that we can use to visualize effects relying on independent tools. We used:

Micrometerfor exposing metrics in a Prometheus format,
Prometheusfor gathering metrics,
Grafanafor visualizing application state during carried out assaults.

Testing

To observe the work of Chaos Monkey we generated some traffic. For this purpose, we have used Gatling. It allows us to simulate multiple users that make requests to our application at the same time. Every Gatling test scenario looks as follow:

enable Chaos Monkey,
enable and configure assault type,
make a bunch of requests to the application’s API endpoints,
disable Chaos Monkey.

Micrometer

In order to expose suitable metrics , we used micrometer-registry-prometheus and spring-boot-starter-actuator. Versions are inherited indirectly from Spring Boot Starter Parent.

1
2
io.micrometer micrometer-registry-prometheus
                              org.springframework.boot spring-boot-starter-actuator

Prometheus

Prometheus instance gathers metrics from all microservices with integrated Chaos Monkey library. It is used as a data source in Grafana.

Grafana

A Grafana dashboard id=9845 has been used. Some improvements were made. More readable labels and dedicated panels for Chaos Monkey watchers.

More configuration details and examples can be found over on GitHub.

Conclusion

Chaos Monkey for Spring Boot is a mature tool for conducting resilience tests. The main advantage is the simplicity of applying it in existing systems. It is dedicated to Spring Boot applications. No code modifications are needed. A brief configuration is enough. All other settings can be changed at runtime; it makes it highly flexible.

Chaos Monkey provides 4 different assault types that can be carried out at any time during the runtime. The process of interfering is recorded by using exposed metrics. There is no built-in reporting functionality. However, those metrics can be consumed and visualised in external tools.

Explore more stories like this on our Tech Blog

Click Here