Quick answer: A live ops runbook documents step-by-step procedures for incident response, hotfix deployment, server restarts, rollbacks, and player communication. Write it before launch. Include severity definitions, escalation paths, exact commands, decision criteria, and pre-written communication templates. The runbook exists so your team can follow a rehearsed process under pressure instead of making it up in real time.

The worst time to figure out your incident response process is during an incident. At 2 AM on launch night, with thousands of players unable to connect and your Discord blowing up, you do not want to be debating who should restart the servers, whether you need approval to deploy a hotfix, or what to tell the community. You want a document that says: step one, do this; step two, do that; step three, post this message. A runbook is that document, and it is the single most important thing you can write before your game goes live.

Incident Response: The First Five Minutes

When an alert fires or a player reports a critical issue, the first five minutes determine whether the incident is resolved quickly or spirals into chaos. Your runbook should script those five minutes precisely. Who gets notified first? What channel do they use to communicate? How do they assess severity? What are their options?

Define three or four severity levels. Severity 1 (S1) means the game is unplayable for all or most players: servers down, login broken, data loss. Severity 2 (S2) means a major feature is broken but the game is partially playable: matchmaking failing, in-app purchases not processing, a specific game mode crashing. Severity 3 (S3) means a noticeable bug that does not prevent play: a visual glitch, a leaderboard displaying stale data, a non-critical UI element missing. Severity 4 (S4) is cosmetic or minor, to be handled during normal working hours.

For each severity level, define the response expectations. S1 incidents require immediate response regardless of time of day, all hands on deck, player communication within 15 minutes, and resolution or mitigation within one hour. S2 incidents require response within 30 minutes during business hours or one hour off-hours, with a status update to players within one hour. S3 and S4 are handled during the next business day. These timeframes are not arbitrary — they set expectations for both the team and the players.

On-Call Rotation and Escalation

Someone must be responsible for responding to incidents at any given time. For indie studios, a formal on-call rotation might be as simple as “the person who deployed most recently is on-call until the next deploy.” For larger teams, rotate weekly and document the current on-call person in a shared channel topic or a status page.

The on-call person’s job is triage, not repair. They assess severity, notify the right people, and start the documented response procedure. They do not need to be able to fix every possible problem — they need to know who can. The runbook should include a contact list with each team member’s area of expertise: who owns the backend, who owns matchmaking, who owns the database, who owns the client build. Include phone numbers, not just Slack handles, because Slack is the first thing that goes down during a major incident.

Hotfix Workflow

A hotfix is a code change deployed outside the normal release cycle to fix a critical issue. The runbook should document the exact steps for building, testing, and deploying a hotfix, including any steps that differ from the normal release process. In an emergency, you do not want someone guessing which CI checks can be skipped and which cannot.

# hotfix_procedure.md — excerpt from the runbook

## Hotfix Deployment Steps

1. Create branch from the current production tag:
   git checkout -b hotfix/ISSUE-ID production-v1.4.2

2. Apply the minimal fix. Do not refactor. Do not add features.
   - One commit, one fix, smallest possible diff

3. Run the fast-track CI suite (smoke tests + critical path only):
   make test-critical

4. Get approval from at least one other engineer (async OK):
   - Post the diff in #hotfix-review
   - Any engineer can approve with a thumbs-up

5. Deploy to staging, verify the fix, check for regressions:
   make deploy-staging
   - Spend 5 minutes manually testing the affected flow

6. Deploy to production:
   make deploy-production

7. Monitor crash rate and error logs for 15 minutes.
   - If crash rate increases, execute the rollback procedure

8. Post resolution message in player-facing channels.

Notice that the hotfix process is explicitly simpler than the normal release process. The full test suite might take 45 minutes; the fast-track suite takes 5. The normal process requires two approvals; the hotfix requires one. These shortcuts exist because speed matters during an incident, but they are documented so that everyone agrees on what is being skipped and why. Undocumented shortcuts made under pressure become undocumented technical debt.

Server Restart and Rollback Procedures

Document the exact commands for restarting each service your game depends on: game servers, API servers, matchmaking, database, cache, CDN. Include the expected downtime for each restart, any player-visible effects (will players be disconnected? will sessions be lost?), and the order in which services should be restarted if dependencies exist.

The rollback procedure is the escape hatch when a hotfix makes things worse. It should be a single command or a very short sequence of commands that reverts to the previous known-good version. Document the command, the expected behavior during rollback, and how to verify that the rollback was successful. Test the rollback procedure before you need it — deploy to staging, roll back, and verify that the previous version is running correctly. An untested rollback is not a rollback; it is a hope.

Communication Templates

Player communication during an incident is as important as the technical fix. A silent outage breeds conspiracy theories and rage. A transparent, timely update builds trust even when the news is bad. Pre-write templates for every stage of an incident so the community manager (or the on-call engineer, if you do not have a community manager) can post within minutes instead of agonizing over wording.

Write four templates. The initial acknowledgment: “We are aware of an issue affecting [service]. We are investigating and will provide an update within [timeframe].” The status update: “We have identified the cause of [issue] and are deploying a fix. Estimated resolution: [timeframe].” The resolution confirmation: “The issue affecting [service] has been resolved. [Brief explanation of what happened and what was fixed].” The post-incident summary: a longer message posted the next day with a timeline, root cause, and what you are doing to prevent recurrence.

Adapt the templates for each channel: shorter for Twitter/X, more detailed for Discord, in-game notifications for the most critical issues. Include links to your status page if you have one. The goal is to communicate early, communicate often, and communicate honestly. Players forgive outages; they do not forgive silence.

“A runbook is not a plan for what will go wrong. It is a plan for how you will respond when something goes wrong. The former is impossible to predict completely. The latter is entirely within your control.”

Keeping the Runbook Alive

A runbook that was written at launch and never updated is worse than no runbook at all, because it gives false confidence. After every incident, review the runbook and update it. Did the escalation path work? Was the hotfix procedure missing a step? Did the communication template need a different tone? Add a runbook review to your post-incident process.

Store the runbook where the team can find it instantly under pressure: a pinned message in your incident channel, a bookmarked wiki page, a printed copy on the office wall. Do not bury it in a Confluence space that requires three clicks and a search to reach. The on-call person at 2 AM should be able to open the runbook in under ten seconds.

Run a tabletop exercise once a quarter. Pick a hypothetical scenario (“The database is corrupted and matchmaking is down”), walk through the runbook, and see if the steps still make sense. This practice costs one hour per quarter and saves dozens of hours over the year. The team that rehearses its incident response is the team that resolves incidents calmly. The team that does not is the team that panics.

Related Issues

For handling the flood of player reports that accompany live incidents, see how to handle player reports during live events. For writing the hotfix patch notes that follow an incident, read how to write hotfix patch notes that build player trust.

The best runbook is the one you wrote last month, tested last week, and opened thirty seconds after the alert fired tonight.