Quick answer: At 2am, your on-call engineer doesn't remember which command restarts the matchmaker. A runbook with copy-pasteable commands and decision trees turns minutes into seconds.
The on-call rotation isn't the engineer who built the system. The runbook is what closes the gap.
Lead with the decision tree
'Symptom: matchmaker queue not draining.' Three branches: (a) check fly logs, (b) restart, (c) page the on-call lead. Each branch has the exact command.
Copy-paste commands only
No 'replace with appropriate value' instructions. Write the actual command with the actual env var. The engineer at 2am doesn't have ambient context.
Test the runbook quarterly
Stage a fake incident. The on-call follows only the runbook. Wherever they hesitate, the runbook needs an update.
Link to the dashboard
Every symptom heading links to the relevant Grafana panel. The graph confirms or denies the symptom faster than a fresh query.
“A runbook is a love letter from sober-you to sleep-deprived-you.”
Keep runbooks in the same repo as the code. Pull requests to the runbook are reviewed alongside related changes; staleness becomes harder to ignore.