What is a flaky test?

A flaky test is one that sometimes passes and sometimes fails without any code changes. The outcome depends on timing, order of execution, shared state, or external factors like network conditions. Flaky tests are dangerous because they erode confidence in your CI — developers start assuming red builds are 'probably flaky' and miss real regressions.

Should I just retry flaky tests automatically?

No, not as a first response. Automatic retry hides the underlying problem and trains your team to ignore failures. Instead, detect flakes, quarantine them out of the main pipeline, and fix them. Retries are acceptable as a short-term mitigation while you investigate, but should never be permanent.

What causes tests to be flaky in game projects?

The most common causes in game tests are: timing dependencies on frame rate or wall clock, shared state between tests (scene state, singletons, static variables), random number generators without fixed seeds, physics simulation order sensitivity, asset loading race conditions, and tests that depend on coroutines or async operations completing in a specific order.

How to Identify Flaky Tests in Your Game CI Pipeline

Quick answer: Track every test run in a database, calculate a flakiness score per test based on pass-fail transitions across consecutive runs, quarantine tests that cross a threshold out of the main pipeline, and fix the underlying nondeterminism. Never ship auto-retry as a permanent solution — it hides real regressions. The most common game-specific flake causes are shared scene state, unseeded randomness, and coroutine timing.

Flaky tests are the silent killer of CI pipelines. You start with a handful of unreliable tests. Your team starts seeing occasional red builds and shrugs them off. “Probably just that flaky test again, re-run it.” Eventually, when a real regression lands, nobody believes the failing test is real. They re-run it. They merge anyway. The regression ships. Here’s how to stop the spiral before it starts.

What Makes Tests Flaky in Game Projects

Games have specific sources of nondeterminism that other software doesn’t. Knowing these helps you diagnose faster:

Frame timing: Tests that assert “after 2 seconds the thing should have moved 20 units” break on slow CI runners that produce different frame rates.
Shared scene state: A test modifies a singleton, the next test finds the singleton in an unexpected state.
Physics order: Physics step order can depend on insertion order, which depends on test execution order.
Coroutines and async: A test that waits “until next frame” races with another coroutine.
Random generators: RNG without a fixed seed produces different output per run.
Asset loading: Addressables or async resource loads complete in non-deterministic order.
Multithreading: Jobs or threads that write to shared state without proper synchronization.
Wall clock: Tests that use DateTime.Now instead of injected time.

Step 1: Track Every Test Run

You can’t fix what you don’t measure. Store the result of every test run in a database, keyed by test name, with columns for: timestamp, commit SHA, branch, pass/fail, duration, CI runner ID, and failure message if failed.

After a few hundred runs, you have the data you need to detect flakiness. A stable test either passes consistently or fails consistently on a given commit. A flaky test shows pass-fail-pass-fail transitions within the same commit range.

-- SQL schema for tracking test runs
CREATE TABLE test_runs (
    id BIGINT PRIMARY KEY AUTO_INCREMENT,
    run_id VARCHAR(64) NOT NULL,
    test_name VARCHAR(255) NOT NULL,
    commit_sha VARCHAR(40) NOT NULL,
    branch VARCHAR(255) NOT NULL,
    started_at DATETIME NOT NULL,
    duration_ms INT NOT NULL,
    result ENUM('pass', 'fail', 'skip', 'error') NOT NULL,
    failure_message TEXT,
    runner VARCHAR(64),

    INDEX idx_test_name (test_name),
    INDEX idx_commit (commit_sha),
    INDEX idx_started (started_at)
);

Step 2: Calculate a Flakiness Score

A useful flakiness score counts state transitions: how often the same test on the same commit flips between pass and fail. A purely deterministic test has 0 transitions per commit. A flaky test has 1 or more.

-- Find flaky tests: tests that both passed and failed on the same commit
SELECT
    test_name,
    COUNT(DISTINCT commit_sha) AS commits_with_flips,
    COUNT(*) AS total_runs,
    SUM(CASE WHEN result = 'pass' THEN 1 ELSE 0 END) AS passes,
    SUM(CASE WHEN result = 'fail' THEN 1 ELSE 0 END) AS fails,
    SUM(CASE WHEN result = 'fail' THEN 1 ELSE 0 END) / COUNT(*) * 100
        AS fail_rate_pct
FROM test_runs
WHERE started_at > DATE_SUB(NOW(), INTERVAL 14 DAY)
GROUP BY test_name
HAVING commits_with_flips > (
    SELECT COUNT(DISTINCT commit_sha) / 20
    FROM test_runs
    WHERE started_at > DATE_SUB(NOW(), INTERVAL 14 DAY)
)
ORDER BY commits_with_flips DESC;

The inner query determines a threshold: a test is considered flaky if it has flips on more than 5% of commits in the last 14 days. Adjust this based on your team’s tolerance. Stricter teams use 1% or 2%.

Step 3: Quarantine, Don’t Retry

When a test crosses the flakiness threshold, automatically move it to a quarantine suite. This suite still runs but does not block the build. Developers see the results but can merge even if quarantined tests fail.

Quarantine creates accountability. The test is still visible, still running, and someone owns fixing it — but it’s no longer blocking the whole team. Contrast with auto-retry, which hides the problem entirely and trains people to ignore failures.

// GitHub Actions workflow with separate stable and quarantined test jobs
jobs:
  stable-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run stable tests
        run: |
          cd api
          go test ./cmd/server/ -run '^Test(?!Flaky)' -v -timeout 5m
    # This job's result gates PR merges

  quarantined-tests:
    runs-on: ubuntu-latest
    continue-on-error: true  # does not block merge
    steps:
      - uses: actions/checkout@v4
      - name: Run quarantined tests
        run: |
          cd api
          go test ./cmd/server/ -run '^TestFlaky' -v -timeout 5m
      - name: Notify test owners of failures
        if: failure()
        run: ./scripts/notify-quarantine-owners.sh

Step 4: Fix the Root Cause

Every quarantined test needs an owner and a fix deadline. Common fixes by cause:

Shared state: Add explicit teardown that resets singletons, clears scene state, and releases resources. Better: use dependency injection so tests own their state.

// Good: explicit teardown
[TearDown]
public void ClearState()
{
    GameManager.Reset();
    SceneManager.UnloadAllScenes();
    Resources.UnloadUnusedAssets();
}

// Better: no shared state to begin with
[Test]
public void EnemyTakesDamage()
{
    var gm = new GameManager();  // fresh instance
    var enemy = new Enemy(gm, health: 100);
    enemy.TakeDamage(30);
    Assert.AreEqual(70, enemy.Health);
}

Timing dependencies: Replace wall-clock waits with event-based waits. Instead of yield return new WaitForSeconds(2), wait for a specific signal like yield return new WaitUntil(() => loader.IsComplete).

Random generators: Seed them explicitly in test setup. Every test that uses randomness should set the seed to a known value:

[SetUp]
public void SeedRandom()
{
    UnityEngine.Random.InitState(12345);
    // Or if using System.Random
    _rng = new System.Random(12345);
}

Order-dependent tests: Run tests in random order in CI. Most test frameworks support this. If tests only pass in a specific order, you have shared state that needs fixing.

Async / coroutines: Use explicit completion signals instead of frame counts. If a test needs to wait for something to load, wait for an event or callback, not “3 frames.”

Dashboarding Flaky Tests

Build a dashboard that shows the top 20 flakiest tests, their owners, and the trend over time. Review it weekly. Celebrate when a flaky test gets fixed and removed from the list. Publicly tracking this metric creates a cultural norm that flakiness is not acceptable.

Also track the overall CI pass rate on the main branch. If it drops below 99% for reasons other than real regressions, investigate. A single high-volume flaky test can drag the metric down and mask real issues.

Related Issues

For broader CI setup, see How to Set Up Automated Build Testing for Games. For regression test strategies, see How to Write Regression Tests for Game Bugs. For smoke tests specifically, check How to Set Up Smoke Tests for Game Builds.

A flaky test is lying to you. Don’t retry — listen, then fix.