Quick answer: Weight four signals into a single composite: crash-free rate (40%), smoke test pass rate (20%), normalized bug inflow (20%), and p95 frame time regression (20%). Score out of 100, review weekly, treat >5-point drops as incidents.

Every week your team ships builds. Some are stable; some aren’t. Without a single number, you argue about whether “things are getting worse.” With a composite stability score, the trend line speaks for itself. The score becomes the shared language of release readiness.

The Four Signals

1. Crash-free session rate (40%). Percent of sessions that ended without a crash. Most important single metric; any game should target > 99.5%.

2. Smoke test pass rate (20%). Percent of CI smoke tests passing on the latest build. A failing smoke test is a bug QA hasn’t seen yet.

3. Normalized bug inflow (20%). New bugs filed per 1000 MAU per week, inverted and scaled. High inflow means players are hitting issues.

4. P95 frame time regression (20%). Current p95 vs a baseline. Larger regressions score lower.

Computing the Score

def stability_score(metrics):
    crash = min(100, (metrics.crash_free_rate - 0.98) * 5000)  # 0.98 -> 0, 1.0 -> 100
    smoke = metrics.smoke_pass_rate * 100
    bugs = max(0, 100 - (metrics.new_bugs_per_1k_mau * 10))
    perf = max(0, 100 - metrics.p95_regression_ms * 2)

    return 0.4 * crash + 0.2 * smoke + 0.2 * bugs + 0.2 * perf

Output 0–100. 90+ is green, 75–90 yellow, under 75 red. The weights reflect that crashes affect every player while frame time affects a specific segment.

Tracking Weekly

Plot the score on a line chart with the four component bars stacked below. The composite tells you “how are we doing”; the components tell you “why.”

Review at the same meeting that plans the next milestone. Any drop > 5 points in a week triggers an incident review: what changed, what shipped, what’s the mitigation.

Baselines and Gating

Pick a baseline week for p95 frame time (usually the last shipped release). Regressions are measured from there. Every major release resets the baseline.

For release gating, require a score ≥ 85 on the release branch for 3 consecutive days. This avoids shipping at a local bottom of the curve.

Avoiding Goodhart's Law

Optimizing for the score alone produces perverse incentives. Engineers start gaming smoke tests, suppressing bug reports, or pinning frame times. Review the signals behind the score, not just the number. If smoke tests all pass but players still report issues, your tests don’t cover the right things.

“A stability score you review every week is worth ten scores you look at after launch. It’s a leading indicator, not a postmortem metric.”

Related Issues

For game health dashboards that visualize the score, see how to set up a game health scorecard. For error budgets that complement stability scores, see how to set up error budgets for game stability.

The score is only as good as the data feeding it. Verify every signal monthly; stale metrics make the composite meaningless.