Quick answer: Track every test run in a database, calculate a flakiness score per test based on pass-fail transitions across consecutive runs, quarantine tests that cross a threshold out of the main pipeline, and fix the underlying nondeterminism. Never ship auto-retry as a permanent solution — it hides real regressions. The most common game-specific flake causes are shared scene state, unseeded randomness, and coroutine timing.
Flaky tests are the silent killer of CI pipelines. You start with a handful of unreliable tests. Your team starts seeing occasional red builds and shrugs them off. “Probably just that flaky test again, re-run it.” Eventually, when a real regression lands, nobody believes the failing test is real. They re-run it. They merge anyway. The regression ships. Here’s how to stop the spiral before it starts.
What Makes Tests Flaky in Game Projects
Games have specific sources of nondeterminism that other software doesn’t. Knowing these helps you diagnose faster:
- Frame timing: Tests that assert “after 2 seconds the thing should have moved 20 units” break on slow CI runners that produce different frame rates.
- Shared scene state: A test modifies a singleton, the next test finds the singleton in an unexpected state.
- Physics order: Physics step order can depend on insertion order, which depends on test execution order.
- Coroutines and async: A test that waits “until next frame” races with another coroutine.
- Random generators: RNG without a fixed seed produces different output per run.
- Asset loading: Addressables or async resource loads complete in non-deterministic order.
- Multithreading: Jobs or threads that write to shared state without proper synchronization.
- Wall clock: Tests that use
DateTime.Nowinstead of injected time.
Step 1: Track Every Test Run
You can’t fix what you don’t measure. Store the result of every test run in a database, keyed by test name, with columns for: timestamp, commit SHA, branch, pass/fail, duration, CI runner ID, and failure message if failed.
After a few hundred runs, you have the data you need to detect flakiness. A stable test either passes consistently or fails consistently on a given commit. A flaky test shows pass-fail-pass-fail transitions within the same commit range.
-- SQL schema for tracking test runs
CREATE TABLE test_runs (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
run_id VARCHAR(64) NOT NULL,
test_name VARCHAR(255) NOT NULL,
commit_sha VARCHAR(40) NOT NULL,
branch VARCHAR(255) NOT NULL,
started_at DATETIME NOT NULL,
duration_ms INT NOT NULL,
result ENUM('pass', 'fail', 'skip', 'error') NOT NULL,
failure_message TEXT,
runner VARCHAR(64),
INDEX idx_test_name (test_name),
INDEX idx_commit (commit_sha),
INDEX idx_started (started_at)
);
Step 2: Calculate a Flakiness Score
A useful flakiness score counts state transitions: how often the same test on the same commit flips between pass and fail. A purely deterministic test has 0 transitions per commit. A flaky test has 1 or more.
-- Find flaky tests: tests that both passed and failed on the same commit
SELECT
test_name,
COUNT(DISTINCT commit_sha) AS commits_with_flips,
COUNT(*) AS total_runs,
SUM(CASE WHEN result = 'pass' THEN 1 ELSE 0 END) AS passes,
SUM(CASE WHEN result = 'fail' THEN 1 ELSE 0 END) AS fails,
SUM(CASE WHEN result = 'fail' THEN 1 ELSE 0 END) / COUNT(*) * 100
AS fail_rate_pct
FROM test_runs
WHERE started_at > DATE_SUB(NOW(), INTERVAL 14 DAY)
GROUP BY test_name
HAVING commits_with_flips > (
SELECT COUNT(DISTINCT commit_sha) / 20
FROM test_runs
WHERE started_at > DATE_SUB(NOW(), INTERVAL 14 DAY)
)
ORDER BY commits_with_flips DESC;
The inner query determines a threshold: a test is considered flaky if it has flips on more than 5% of commits in the last 14 days. Adjust this based on your team’s tolerance. Stricter teams use 1% or 2%.
Step 3: Quarantine, Don’t Retry
When a test crosses the flakiness threshold, automatically move it to a quarantine suite. This suite still runs but does not block the build. Developers see the results but can merge even if quarantined tests fail.
Quarantine creates accountability. The test is still visible, still running, and someone owns fixing it — but it’s no longer blocking the whole team. Contrast with auto-retry, which hides the problem entirely and trains people to ignore failures.
// GitHub Actions workflow with separate stable and quarantined test jobs
jobs:
stable-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run stable tests
run: |
cd api
go test ./cmd/server/ -run '^Test(?!Flaky)' -v -timeout 5m
# This job's result gates PR merges
quarantined-tests:
runs-on: ubuntu-latest
continue-on-error: true # does not block merge
steps:
- uses: actions/checkout@v4
- name: Run quarantined tests
run: |
cd api
go test ./cmd/server/ -run '^TestFlaky' -v -timeout 5m
- name: Notify test owners of failures
if: failure()
run: ./scripts/notify-quarantine-owners.sh
Step 4: Fix the Root Cause
Every quarantined test needs an owner and a fix deadline. Common fixes by cause:
Shared state: Add explicit teardown that resets singletons, clears scene state, and releases resources. Better: use dependency injection so tests own their state.
// Good: explicit teardown
[TearDown]
public void ClearState()
{
GameManager.Reset();
SceneManager.UnloadAllScenes();
Resources.UnloadUnusedAssets();
}
// Better: no shared state to begin with
[Test]
public void EnemyTakesDamage()
{
var gm = new GameManager(); // fresh instance
var enemy = new Enemy(gm, health: 100);
enemy.TakeDamage(30);
Assert.AreEqual(70, enemy.Health);
}
Timing dependencies: Replace wall-clock waits with event-based waits. Instead of yield return new WaitForSeconds(2), wait for a specific signal like yield return new WaitUntil(() => loader.IsComplete).
Random generators: Seed them explicitly in test setup. Every test that uses randomness should set the seed to a known value:
[SetUp]
public void SeedRandom()
{
UnityEngine.Random.InitState(12345);
// Or if using System.Random
_rng = new System.Random(12345);
}
Order-dependent tests: Run tests in random order in CI. Most test frameworks support this. If tests only pass in a specific order, you have shared state that needs fixing.
Async / coroutines: Use explicit completion signals instead of frame counts. If a test needs to wait for something to load, wait for an event or callback, not “3 frames.”
Dashboarding Flaky Tests
Build a dashboard that shows the top 20 flakiest tests, their owners, and the trend over time. Review it weekly. Celebrate when a flaky test gets fixed and removed from the list. Publicly tracking this metric creates a cultural norm that flakiness is not acceptable.
Also track the overall CI pass rate on the main branch. If it drops below 99% for reasons other than real regressions, investigate. A single high-volume flaky test can drag the metric down and mask real issues.
Related Issues
For broader CI setup, see How to Set Up Automated Build Testing for Games. For regression test strategies, see How to Write Regression Tests for Game Bugs. For smoke tests specifically, check How to Set Up Smoke Tests for Game Builds.
A flaky test is lying to you. Don’t retry — listen, then fix.