Discover why PagerDuty users are switching to Everbridge xMatters. Learn more

Open-Sourcing Cthulhu for Chaos Engineering

Open-Sourcing Cthulhu for Chaos Engineering

In a previous article, I wrote about Chaos Engineering, and how this approach to microservice design helps us at xMatters build software that’s resilient to the chaotic nature of cloud computing. In that article, I mentioned a tool that we built in-house named Cthulhu that we use to test the resiliency of our service.

Cthulhu for all!
Well, I’m thrilled to announce that we opened-sourced Cthulhu in May, and it’s now available to all on GitHub. With this release our hope is to make it easier for organizations to practice Chaos Engineering and design resilient digital services.

Why other chaos engineering tools did not work for us
When I first researched tools to test our infrastructure, the usual suspects popped up. At the time, Chaos Monkey could only target instances on AWS and deployment managed with Spinnaker. As our service is primarily hosted on Google Cloud, this was not a good option for us. PowerfulSeal (Chaos Monkey’s Kubernetes equivalent) had just come out, but at xMatters we use a mix of both virtual machines (VMs) and Kubernetes deployments. We also considered another tool called Pumba that targets Docker containers. It has interesting features that go beyond simply shutting down or deleting instances, and its ability to use the ‘tc’ command to simulate poor network conditions allows for more refined failure experiments.

With our infrastructure, we would have needed a combination of tools to cover our system. This made it difficult to orchestrate failure experiments involving a combination of components. We wanted the ability to reproduce and version-control scenarios so that once a vulnerability is found, our engineers can easily recreate it in different environments. Finally, for our early attempts at running failure experiments, we wanted the ability to selectively notify members of our organizations so that they could monitor the evolution of experiments.

The beating heart of Cthulhu
The following three aspects are at the core of Cthulhu:

  1. Cross-platform failure orchestration
  2. Version-controllable scenarios
  3. The ability to send notifications with a controllable level of verbosity about the experiments being run.

Cthulhu is also designed to be extensible so it’s easy to add cloud platforms and types of operations.

How you can use Cthulhu on your own services
Let’s look at a few examples of how you can use Cthulhu to run experiments on services. Microservice architecture can be quite complex, so it can take time to fully understand. For that reason, my examples will only focus on a subset of a fictitious online grocery service.

How you can use the Cthulhu chaos engineering tool on your own services

How you can use the Cthulhu chaos engineering tool on your own services

Let’s consider the following flows:

  1. A user adds products to a cart. The system has separate services to manage user profiles and shopping carts.
  2. When the user is ready, they send their cart for payment using one of three payment methods: credit card, e-transfer, and cash at the door.
  3. Once a payment service authorizes the transaction, it creates a new order in the order services, which holds information like the authorization number, shipping status, and tracking number.

Each payment method has a flow of its own, so they’re implemented as different services.

Payment process using a credit cardThe credit card service sends the user to a third-party card processing service to complete the payment.

  1. Once completed, the third-party service posts the confirmation number and amount paid back to the credit card service.
  2. Once the confirmation is received, the credit card submits the order.
  3. If no confirmation is received after 15 minutes, the payment is canceled.

 

The e-transfer payment processThe e-transfer asks the user to send a money email to a dedicated email address, with the invoice number as subject.

  1. Upon receiving emails, the service accepts the transfer and submits the order.
  2. If no email is received after 25 hours, the payment is canceled.

 

The cash-at-the-door service submits the order immediately.Cash at the door payment process

  1. It notes that the order requires a payment upon delivery.

 

All services are running in three replicas managed by Kubernetes.

  1. The email server runs in a VM that is mirrored in three instances.
  2. There’s a gateway (running in three replicated VMs) between the credit card service and the third-party card processing service.

Let’s build a few hypotheses and scenarios for Cthulhu based on the latter environment:

name: Credit card service connection with 3rd party

# Hypothesis: When the Credit card service is not able to connect
# to the 3rd party card processing service, it flags itself as
# "unhealthy" and an alert is dispatched to the support team.

chaosevents:
- description: Stopping all 3rd party card processing gateway.
engine: gcp-compute
operation: stop
# target: gcp-region-regex/instance-name-regex
target: .*/^gateway-3rd-party.*
# stop 3 vms out of the target matches
quantity: 3

Now let’s unpack that a bit:

  • A scenario file has a name and each chaos event has a description; Cthulhu uses those to log what it’s doing.
  • The engine tells Cthulhu what infrastructure to target.
  • The target is a combination of Regular Expressions that finds VM candidates for the event. In this case we’re looking for any VMs whose name starts with “gateway-3rd-party” in all available regions.
  • The operation describes what will be done to the targets.
  • Quantity limits the number of targets affected.

Similar to the previous example, in the example below we target VMs whose name starts with “mail,” in all regions in “us-central”:

name: E-transfer service connection with mail server
# Hypothesis: When the E-transfer service cannot connect to the mail
# server, it flags itself as "unhealthy" and an alert is dispatched
# to the support team.

chaosevents:
- description: Stopping all mail servers.
engine: gcp-compute
operation: stop
# target: gcp-region-regex/instance-name-regex
target: ^us-central.*/^mail.*
# stop 3 vms out of the target matches
quantity: 3

In this final example, we define a chaos event for each type of payment service:

name: Cart service detection of unavailable payment services
# Hypothesis: Payment services that are not healthy show as
# "temporarily unavailable" in the payment choices, when
# checking out.

chaosevents:
- description: Delete pods for the Credit card service.
engine: kubernetes
operation: delete
# target: kube-namespace-regex/pod-name-regex
target: credit-card-payment/^credit-card-payment-.*
# stop 3 vms out of the target matches
quantity: 3
# Assuming pods are recreated by their deployment, we
# keep deleting them every minute for an hour.

schedule:
delay: 1m
initialdelay: 0
repeat: 60.

- description: Delete pods for the E-transfer service.
engine: kubernetes
operation: delete
target: e-transfer-payment/^e-transfer-payment-.*
quantity: 3
schedule:
delay: 1m
initialdelay: 0
repeat: 60

- description: Delete pods for the Cash at the door service.
engine: kubernetes
operation: delete
target: cash-at-door-payment/^cash-at-door-payment-.*
quantity: 3
schedule:
delay: 1m
initialdelay: 0
repeat: 60

Those services are running in a Kubernetes cluster, so we set the engine accordingly.

Assuming the service pods are managed by a deployment, we use the delete operation on our targets, and Kubernetes will create replacements as needed. The target for Kubernetes matches namespaces and pod names instead of regions and VM names.

Because the deployment in Kubernetes replaces deleted pods, we added a schedule with repetition so that those services will be kept offline for an hour.

Scenario files are designed to be easy to read and write. The readme file of Cthulhu provides more details about the different operations and parameters that are available.

I hope you enjoy Cthulhu, and that it helps you build resilient software!

Have some experience with chaos engineering? Let us know on our social channels.

Request a demo