Arun Shah

A Practical Introduction

to Chaos Engineering

Introduction: Beyond Testing

In the world of distributed systems, failure is inevitable. Hardware fails, networks become unreliable, and software has bugs. While traditional testing can validate known conditions, it often fails to uncover the complex, unpredictable failures that occur in production. This is where Chaos Engineering comes in.

Chaos Engineering is not about breaking things randomly; it is a disciplined, scientific approach to revealing weaknesses in your system. By proactively injecting controlled failures, you can identify and fix issues before they cause a full-blown outage, thereby building confidence in your system’s resilience.

This post provides a practical introduction to the principles of Chaos Engineering, explores popular open-source tools, and walks through a hands-on example of running a chaos experiment in Kubernetes.

The Principles of Chaos Engineering

At its core, Chaos Engineering is about conducting experiments to measure the resilience of a system. These experiments follow a four-step process:

  1. Define a Steady State: First, establish a measurable, quantifiable metric that represents the normal behavior of your system. This could be a specific business metric (e.g., orders per minute), a system metric (e.g., CPU utilization), or an application metric (e.g., API latency). A consistent steady state is critical for knowing whether your experiment has had an impact.

  2. Formulate a Hypothesis: State a clear hypothesis about what you expect to happen when you introduce a specific failure. For example: “If we inject 100ms of network latency into the payment service, the overall checkout success rate will not drop below 99%.”

  3. Inject Real-World Failures: Introduce failures that mimic real-world events. This could include:

    • Resource Exhaustion: High CPU, memory, or disk usage.
    • Network Unreliability: Latency, packet loss, or blackouts.
    • Instance Failures: Terminating pods, VMs, or containers.
    • Dependency Failures: Making a critical downstream service unavailable.
  4. Verify and Learn: Compare the steady state of your system during the experiment to your hypothesis. If there is a deviation, you have found a weakness. The goal is not for the experiment to “pass,” but to learn from the outcome and improve the system’s resilience.

Crucially, always minimize the blast radius. Start experiments in a development or staging environment, and when you move to production, begin with a small, controlled scope (e.g., a single host or a small percentage of traffic).

Tooling: The Chaos Engineering Arsenal

Several open-source tools have emerged to help you implement Chaos Engineering.

For this post, we will focus on LitmusChaos due to its Kubernetes-native approach and growing popularity.

Hands-On: Your First Chaos Experiment with LitmusChaos

Let’s walk through a simple experiment: terminating a pod in a Kubernetes application and observing whether the system recovers.

1. Install LitmusChaos

You can install LitmusChaos in your Kubernetes cluster using Helm:

# Add the LitmusChaos Helm repository
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/

# Create a namespace for Litmus
kubectl create ns litmus

# Install the LitmusChaos control plane
helm install chaos litmuschaos/litmus --namespace=litmus

2. Deploy a Sample Application

We need an application to experiment on. Let’s deploy a simple Nginx deployment:

# nginx-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        ports:
        - containerPort: 80

Apply it with kubectl apply -f nginx-deployment.yaml.

3. Define the Chaos Experiment

LitmusChaos uses Custom Resource Definitions (CRDs) to define experiments. We will use the pod-delete experiment.

First, we need to set up the necessary RBAC for the experiment:

# chaos-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: pod-delete-sa
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-delete-role
  namespace: default
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["create","delete","get","list","patch","update","deletecollection"]
# ... other permissions
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: pod-delete-rb
  namespace: default
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: pod-delete-role
subjects:
- kind: ServiceAccount
  name: pod-delete-sa

Now, let’s define the ChaosEngine, which links the application, the experiment, and the RBAC:

# chaos-engine.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-chaos
  namespace: default
spec:
  engineState: 'active'
  appinfo:
    appns: 'default'
    applabel: 'app=nginx'
    appkind: 'deployment'
  chaosServiceAccount: pod-delete-sa
  experiments:
  - name: pod-delete
    spec:
      components:
        env:
          # Terminate one pod
          - name: TOTAL_CHAOS_DURATION
            value: '30'
          - name: CHAOS_INTERVAL
            value: '10'
          - name: FORCE
            value: 'false'

4. Run the Experiment and Observe

Apply the RBAC and the ChaosEngine:

kubecl apply -f chaos-rbac.yaml
kubectl apply -f chaos-engine.yaml

Now, watch the pods in your deployment:

kubectl get pods -l app=nginx --watch

You will see one of the Nginx pods being terminated, and then Kubernetes’s Deployment controller will automatically bring up a new one to maintain the desired state of 3 replicas. This demonstrates the self-healing nature of a Kubernetes Deployment.

To check the results of the experiment, you can inspect the ChaosResult resource:

kubectl get chaosresult nginx-chaos-pod-delete -o yaml

You will see that the experiment has a verdict of Pass, indicating that the system behaved as expected.

Conclusion

Chaos Engineering is a powerful discipline that moves beyond traditional testing to build truly resilient systems. By embracing controlled, proactive failure injection, you can uncover hidden weaknesses, validate your assumptions, and gain the confidence that your system can withstand the inevitable turbulence of production. Tools like LitmusChaos make it easier than ever to get started with Chaos Engineering in a Kubernetes-native way. Start small, be scientific, and build a culture of resilience in your organization.

Further Reading

To learn more about Chaos Engineering, I highly recommend the following resources:

Comments