Quick answer: Measure a steady state (match completion rate, reconnect rate, p99 tick time), inject one fault at a time (packet loss, process kill, CPU pressure, network partition) against a staging cluster, and record which parts of the stack break first. Promote proven experiments into a nightly pipeline and only then carefully graduate them to production.
Every multiplayer game ships with failure modes nobody anticipated: the server that crashes when a single player’s connection hiccups at the wrong moment, the matchmaker that deadlocks when the database is slow, the reconnect flow that doubles the player count because the ghost session never cleared. Chaos engineering finds these bugs on purpose, by injecting the same conditions that randomly hit you in production so you can fix them calmly rather than at 2 AM.
Start With a Steady State
You cannot tell whether an experiment made things worse unless you know what “good” looks like. Define a steady state with three to five metrics that capture healthy behavior under load:
- Match completion rate (expect > 98 percent)
- Reconnect success rate (expect > 95 percent)
- p99 server tick time (expect < 16 ms for 60 Hz servers)
- Session error rate (expect < 0.5 percent)
- Matchmaking time-to-match (expect < 60 seconds)
Run a baseline load test in staging with synthetic players and record these numbers. Every chaos experiment compares back to this baseline. If an experiment does not measurably move any of them, either the fault was too small or your metrics are not sensitive enough.
Start With Packet Loss and Latency
The simplest chaos experiment is degrading the network. Linux tc with the netem qdisc is the standard tool. It lets you inject packet loss, latency, jitter, and reordering on any network interface.
# Inject 5% packet loss + 100ms latency + 20ms jitter
tc qdisc add dev eth0 root netem loss 5% delay 100ms 20ms
# Remove when done
tc qdisc del dev eth0 root
Run this on a game server while a match is live, then watch the metrics. What you are looking for is graceful degradation: higher reconnect rate, no new crashes, match completion rate holds. What you are hoping not to see is a 4x spike in tick time (your server is doing expensive work per dropped packet) or a sudden drop in completed matches (clients giving up and kicking players out).
Kill Processes Randomly
A pod or a process can die for any reason: OOM, a SIGKILL from the scheduler, a hardware fault. Your game should survive the death of a single server with no more than a 30-second match disruption. The only way to know is to kill a server and watch.
On Kubernetes, Chaos Mesh lets you schedule pod kills declaratively:
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: random-match-kill
spec:
action: pod-kill
mode: one
selector:
labelSelectors:
"app": match-server
scheduler:
cron: "@every 10m"
Without Kubernetes, a cron job that SIGKILLs a random server process every ten minutes is enough. Script the experiment, record which matches were affected, and check that the game client reconnects cleanly, the matchmaker opens a replacement, and no player’s inventory got duplicated during the transition.
CPU and Memory Pressure
Slow servers produce different bugs than dead servers. A server stuck at 100 percent CPU for 30 seconds may still hold TCP connections open but miss tick deadlines, causing rubber-banding on every client. stress-ng is the right tool for CPU and memory pressure:
# Saturate half the cores for 60 seconds
stress-ng --cpu $(( $(nproc) / 2 )) --timeout 60s
# Burn 2 GB of memory for 5 minutes
stress-ng --vm 2 --vm-bytes 1G --timeout 300s
The question chaos asks is: does your tick scheduler skip work gracefully, or does it try to catch up and make things worse? Many game servers queue physics steps when frames are slow, then run them all at once when CPU frees up, producing a visible hitch on every client. Finding this in a chaos test is cheap; finding it in production is very expensive.
Network Partitions
A partition splits your infrastructure into two halves that can each still reach players but not each other. This is the classic split-brain scenario and the most dangerous bug class in distributed systems. If your matchmaker and your game servers run in separate processes and the network between them drops for 20 seconds, what happens?
Simulate with iptables:
# Block traffic from match servers to matchmaker
iptables -A OUTPUT -d 10.0.0.100 -j DROP
# Restore
iptables -D OUTPUT -d 10.0.0.100 -j DROP
Watch for cascading failures. A classic pattern: the match server cannot report match results to the matchmaker, retries for 30 seconds, then enters a reconnect storm when the partition heals, overwhelming the matchmaker and knocking it over at exactly the wrong moment. Your retry logic should back off exponentially with jitter, not hammer the moment connectivity returns.
Automate the Pipeline
One-off experiments find bugs once. A chaos pipeline finds them every week as your code changes. Schedule the full battery nightly on a staging cluster under a realistic synthetic load. Any run that violates the steady-state thresholds opens a ticket automatically.
Start small: ten experiments, one hour of runtime, simple pass/fail criteria. Grow the battery as you fix what the first round found. Graduate to production only after months of clean staging runs, and only on a small percentage of traffic with automatic abort.
“Our first chaos run found a deadlock in match cleanup that only triggered when the database had more than 200 ms of latency. That bug had been in shipping code for eight months and caused weekend-long outages whenever our cloud provider had a bad afternoon. One night of chaos testing fixed what eight months of production hadn’t.”
Related Issues
For related server-stability topics, see bug reporting for multiplayer games. For the release-level equivalent, read how to track and reduce crash rate over releases.
Run your first chaos experiment tomorrow. Pick one server, inject 5 percent packet loss for five minutes, and watch what breaks. You will learn more about your stack in those five minutes than in the last month of uptime.