Why pair crashes with commits?

Pairing answers three questions at once: did this commit actually fix the crash, did it introduce a new crash, and how long did it take you to ship a fix. Without that pairing you are guessing, and your postmortems get very vague.

How do you know which commit fixed a crash?

Compare the crash-signature rate in the build before and the build after each commit. If the signature’s rate drops by more than a threshold and stays down for a stable window of sessions, flag that commit as the probable fix. Let engineers confirm or reject the match.

What is a confidence score for a fix?

A confidence score combines the magnitude of the rate drop, the number of sessions observed after the commit, and whether the commit touched code near the top of the stack. Scores above 0.8 are usually reliable; below 0.5 they need human review.

How to Build a Crash-to-Fix Pairing System

Quick answer: Fingerprint every crash, track its rate per build, and flag commits that cause a sustained drop as candidate fixes. Score each candidate by rate magnitude, sample size, and code proximity, then let engineers confirm the match. The output is a living ledger of which commit fixed which crash, with before/after counts baked in.

Most studios know when a crash is fixed only because someone remembered to close the ticket. A crash-to-fix pairing system makes that link mechanical. For every crash signature in your backlog, it names the commit that most likely fixed it, reports the rate drop, and produces a confidence score. That ledger is the single source of truth for “did we actually ship the fix?” and “how fast do we close crashes now vs. six months ago?” It also catches regressions, because the same machinery flags commits that raise a signature back above zero.

Fingerprint First, Everything Else Second

A pairing system is only as good as its grouping. If the same bug produces five different signatures across builds, you cannot track its rate over time. Start with a stable fingerprint: normalize module offsets, strip inlined frames that vary by compiler version, canonicalize template names, and hash the top three to five frames. Keep the hash short (an 8-hex prefix is plenty) and store the full normalized frame list alongside it for debugging.

Guard against fingerprint churn by asserting that the hash function is reproducible. Check in a set of sample stacks and their expected fingerprints as a test, and fail CI if anyone changes the hashing in a way that would regroup live data.

Emit Rates, Not Counts

Raw crash counts lie. A build that ran on ten sessions and crashed twice looks worse than a build that ran on a million sessions and crashed a thousand times. Always divide by exposure. Record crashes-per-session per signature per build, and only compare rates between builds with similar sample sizes. A 10x drop based on 40 sessions is noise; a 2x drop based on 400,000 sessions is a real result.

SELECT build_id, signature,
       COUNT(*) AS crashes,
       COUNT(DISTINCT session_id) AS sessions,
       COUNT(*) * 1.0 / COUNT(DISTINCT session_id) AS rate
FROM crash_events
WHERE ingested_at > NOW() - INTERVAL '30 days'
GROUP BY build_id, signature;

Candidate Commits: The First Pass

Once you have rate-per-signature-per-build, pair each signature’s builds in order. Whenever the rate drops by more than a threshold (I use 75% or the signature reaches zero) between build N and build N+1, every commit in the range (N, N+1] becomes a candidate fix. For most studios that is between 5 and 200 commits, which is too many. The next step prunes.

Prune by file proximity. Parse each candidate commit’s diff and extract the list of changed source files. Parse the crash stack’s top frames and extract their source files from your symbol server. Keep candidates whose changed files overlap with the stack’s source files. This typically drops 200 candidates to 2 or 3.

def score_candidate(commit, signature, before, after):
    magnitude = (before.rate - after.rate) / max(before.rate, 1e-6)
    sample = min(after.sessions, 100_000) / 100_000
    proximity = stack_overlap(commit.files, signature.top_frames)
    confidence = 0.5 * magnitude + 0.3 * sample + 0.2 * proximity
    return round(confidence, 2)

Confidence Scoring

For each surviving candidate, produce a score between 0 and 1. Three inputs matter:

Magnitude is how much the rate dropped. A signature that went from 2% of sessions to 0% scores full magnitude. One that went from 2% to 1.5% scores partial.

Sample size is how many sessions you have observed after the candidate commit shipped. Fewer than a thousand sessions and the drop could be random variation. Above a hundred thousand and statistical noise is effectively zero.

Proximity is whether the commit touched code near the stack. A commit that edited the exact function at the top of the crash stack scores 1.0. A commit in an unrelated system scores near zero. This input is the main signal that separates the actual fix from all the coincidental commits in the same release window.

Scores above 0.8 are reliable enough to auto-assign and close the bug. Between 0.5 and 0.8, surface the pairing to the engineer who owns the file and let them confirm. Below 0.5, record the candidate but do not close anything.

Regression Detection Uses the Same Pipeline

Flip the sign on the magnitude term and you have a regression detector. Any commit that raises a previously-zero signature back above a threshold is a regression candidate. Same proximity math, same sample-size math, same confidence score. The only difference is the alert channel: regressions get paged; fixes go to the weekly fix-velocity report.

Regressions tend to come from two places: reverts that undo a previous fix, and refactors that touch the same code path. Tag each regression with whether the candidate commit is a revert (check for the This reverts commit X footer) so you can route reverts directly to whoever originally fixed the bug.

Fix Velocity as a Team Metric

Once the pairing table exists, fix velocity falls out for free. For every signature with a confirmed fix, you know the time between first occurrence and fix-shipped-in-production. Aggregate by month and you can answer questions your studio lead actually cares about: are we getting faster at closing crashes, which engineer closes them fastest, and which subsystem takes the longest to repair.

Do not turn these numbers into individual performance metrics. Fix velocity is a team signal; it gets worse when you chase it as a leaderboard. Use it to spot subsystems with structural problems (renderer crashes taking 30 days while gameplay crashes close in 2) and invest there.

Display the Pairing in the Bug Ticket

The final piece is surfacing the pairing where people already look. Every crash ticket should show a “Probable fix” block with the commit SHA, author, build it shipped in, before/after rate, and confidence score. If the engineer disagrees, one click clears the match and the signature goes back into the pool of unresolved crashes.

“Before we built this, our retro always argued about whether last sprint’s hotfix actually stuck. Now the confidence score is right on the ticket and the argument is over before it starts.”

Related Issues

For background on stack grouping, read how to build a crash report deduplication system. For a deeper look at regression tracking, see how to track and reduce crash rate over releases.

If you cannot name the commit that fixed a crash, you cannot tell whether it is actually fixed.