Introduction: Beyond Testing
In the world of distributed systems, failure is inevitable. Hardware fails, networks become unreliable, and software has bugs. While traditional testing can validate known conditions, it often fails to uncover the complex, unpredictable failures that occur in production. This is where Chaos Engineering comes in.
Chaos Engineering is not about breaking things randomly; it is a disciplined, scientific approach to revealing weaknesses in your system. By proactively injecting controlled failures, you can identify and fix issues before they cause a full-blown outage, thereby building confidence in your system’s resilience.
This post provides a practical introduction to the principles of Chaos Engineering, explores popular open-source tools, and walks through a hands-on example of running a chaos experiment in Kubernetes.
The Principles of Chaos Engineering
At its core, Chaos Engineering is about conducting experiments to measure the resilience of a system. These experiments follow a four-step process:
Define a Steady State: First, establish a measurable, quantifiable metric that represents the normal behavior of your system. This could be a specific business metric (e.g., orders per minute), a system metric (e.g., CPU utilization), or an application metric (e.g., API latency). A consistent steady state is critical for knowing whether your experiment has had an impact.
Formulate a Hypothesis: State a clear hypothesis about what you expect to happen when you introduce a specific failure. For example: “If we inject 100ms of network latency into the payment service, the overall checkout success rate will not drop below 99%.”
Inject Real-World Failures: Introduce failures that mimic real-world events. This could include:
- Resource Exhaustion: High CPU, memory, or disk usage.
- Network Unreliability: Latency, packet loss, or blackouts.
- Instance Failures: Terminating pods, VMs, or containers.
- Dependency Failures: Making a critical downstream service unavailable.
Verify and Learn: Compare the steady state of your system during the experiment to your hypothesis. If there is a deviation, you have found a weakness. The goal is not for the experiment to “pass,” but to learn from the outcome and improve the system’s resilience.
Crucially, always minimize the blast radius. Start experiments in a development or staging environment, and when you move to production, begin with a small, controlled scope (e.g., a single host or a small percentage of traffic).
Tooling: The Chaos Engineering Arsenal
Several open-source tools have emerged to help you implement Chaos Engineering.
- Chaos Monkey: The original chaos tool from Netflix, it randomly terminates virtual machine instances and containers.
- Gremlin: A commercial “Failure-as-a-Service” platform with a free tier, offering a wide range of failure modes.
- LitmusChaos: A powerful, open-source, and CNCF-hosted chaos engineering framework specifically for Kubernetes. It provides a rich set of experiments and is designed to be declarative and easy to use.
For this post, we will focus on LitmusChaos due to its Kubernetes-native approach and growing popularity.
Hands-On: Your First Chaos Experiment with LitmusChaos
Let’s walk through a simple experiment: terminating a pod in a Kubernetes application and observing whether the system recovers.
1. Install LitmusChaos
You can install LitmusChaos in your Kubernetes cluster using Helm:
# Add the LitmusChaos Helm repository
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
# Create a namespace for Litmus
kubectl create ns litmus
# Install the LitmusChaos control plane
helm install chaos litmuschaos/litmus --namespace=litmus
2. Deploy a Sample Application
We need an application to experiment on. Let’s deploy a simple Nginx deployment:
# nginx-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:latest
ports:
- containerPort: 80
Apply it with kubectl apply -f nginx-deployment.yaml
.
3. Define the Chaos Experiment
LitmusChaos uses Custom Resource Definitions (CRDs) to define experiments. We will use the pod-delete
experiment.
First, we need to set up the necessary RBAC for the experiment:
# chaos-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: pod-delete-sa
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-delete-role
namespace: default
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["create","delete","get","list","patch","update","deletecollection"]
# ... other permissions
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: pod-delete-rb
namespace: default
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: pod-delete-role
subjects:
- kind: ServiceAccount
name: pod-delete-sa
Now, let’s define the ChaosEngine
, which links the application, the experiment, and the RBAC:
# chaos-engine.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: nginx-chaos
namespace: default
spec:
engineState: 'active'
appinfo:
appns: 'default'
applabel: 'app=nginx'
appkind: 'deployment'
chaosServiceAccount: pod-delete-sa
experiments:
- name: pod-delete
spec:
components:
env:
# Terminate one pod
- name: TOTAL_CHAOS_DURATION
value: '30'
- name: CHAOS_INTERVAL
value: '10'
- name: FORCE
value: 'false'
4. Run the Experiment and Observe
Apply the RBAC and the ChaosEngine:
kubecl apply -f chaos-rbac.yaml
kubectl apply -f chaos-engine.yaml
Now, watch the pods in your deployment:
kubectl get pods -l app=nginx --watch
You will see one of the Nginx pods being terminated, and then Kubernetes’s Deployment controller will automatically bring up a new one to maintain the desired state of 3 replicas. This demonstrates the self-healing nature of a Kubernetes Deployment.
To check the results of the experiment, you can inspect the ChaosResult
resource:
kubectl get chaosresult nginx-chaos-pod-delete -o yaml
You will see that the experiment has a verdict
of Pass
, indicating that the system behaved as expected.
Conclusion
Chaos Engineering is a powerful discipline that moves beyond traditional testing to build truly resilient systems. By embracing controlled, proactive failure injection, you can uncover hidden weaknesses, validate your assumptions, and gain the confidence that your system can withstand the inevitable turbulence of production. Tools like LitmusChaos make it easier than ever to get started with Chaos Engineering in a Kubernetes-native way. Start small, be scientific, and build a culture of resilience in your organization.
Further Reading
To learn more about Chaos Engineering, I highly recommend the following resources:
- Principles of Chaos Engineering: https://principlesofchaos.org/
- LitmusChaos Documentation: https://docs.litmuschaos.io/
- Netflix Technology Blog: https://netflixtechblog.com/
Comments