What makes a test flaky versus genuinely broken?

A flaky test is one that passes and fails on the same code without any changes. If a test fails consistently on a specific commit, it is broken — not flaky. The distinction matters because broken tests should block the pipeline immediately, while flaky tests should be quarantined so they do not block unrelated work.

How many failures should trigger quarantine?

A common threshold is two or more failures in the last ten runs on the same branch, where at least one of those runs also passed. This catches intermittent failures without quarantining tests that fail once due to a legitimate regression. Adjust the window based on your CI frequency.

Should quarantined tests still run in CI?

Yes. Run quarantined tests in a separate non-blocking job so you continue collecting pass/fail data. This data is essential for determining whether a fix has resolved the flakiness. The key is that quarantined test failures do not block merges or deployments.

How to Build a Flaky Test Quarantine System

Quick answer: Track the pass/fail result of every test across CI runs in a lightweight database. When a test both passes and fails within its last ten runs on the same branch, flag it as flaky and move it to a non-blocking CI job. File a bug automatically, assign an owner, and set a two-week deadline. The quarantined test keeps running — you just stop letting it block merges until the root cause is fixed.

Flaky tests are the boy-who-cried-wolf of CI. After the third time a test fails for no reason and passes on retry, your team starts ignoring failures entirely — including the real ones. A quarantine system formalizes the distinction between “this test is unreliable” and “this test is catching a real bug,” so your pipeline stays trustworthy while you fix the flakiness on your own schedule.

Why Flaky Tests Are Dangerous in Game Dev

Game test suites are uniquely prone to flakiness. Physics simulations produce slightly different results across frames. GPU-dependent rendering tests break on different driver versions. Integration tests against live services time out when a backend is slow. Network tests race against connection establishment. Every one of these failure modes is intermittent, and every one of them will erode your team’s trust in CI if you don’t handle them explicitly.

The cost is not just wasted time re-running pipelines. It is missed regressions. When developers learn to click “Retry” without reading the failure, they are training themselves to ignore signals. A quarantine system makes the flaky tests visible and accountable without letting them act as a veto on every merge.

Step 1: Record Every Test Result

Your CI runner already knows which tests passed and which failed. The missing piece is persisting that data across runs. After each test suite execution, write a result record for every test to a shared store — a Postgres table, a SQLite file in your artifact bucket, or even a JSON file committed to a metadata branch.

# record_results.py — post-test CI step
import json, sys, os, datetime, hashlib

def record(junit_xml_path, db_path):
    from junitparser import JUnitXml
    suite = JUnitXml.fromfile(junit_xml_path)
    branch = os.environ.get("CI_BRANCH", "unknown")
    commit = os.environ.get("CI_COMMIT", "unknown")
    ts = datetime.datetime.utcnow().isoformat()

    results = []
    for case in suite:
        results.append({
            "test": f"{case.classname}.{case.name}",
            "passed": case.result is None,
            "duration": case.time,
            "branch": branch,
            "commit": commit,
            "timestamp": ts
        })

    # Append to a JSON-lines file
    with open(db_path, "a") as f:
        for r in results:
            f.write(json.dumps(r) + "\n")

record(sys.argv[1], sys.argv[2])

The format doesn’t matter much at this scale. What matters is that you have a time-ordered history of every test’s pass/fail status, keyed by branch. You need the branch because a test that fails on every run of a feature branch is not flaky — it is broken by that branch’s changes.

Step 2: Detect Flakiness Automatically

After recording results, run a detection pass. The simplest effective heuristic: look at the last ten runs of each test on the current branch. If the test has at least one pass and at least two failures in that window, it is flaky. Tests that fail every run are broken; tests that fail once are noise.

Store the quarantine list in a configuration file that your test runner reads. When a test is on the list, the runner still executes it but marks the result as advisory rather than blocking. In most CI systems this means moving the test to a separate job whose failure does not gate the merge check.

Automatically file a bug in your tracker for each newly quarantined test. Include the test name, the failure rate, the last three failure logs, and a link to the CI runs. Assign it to the team that owns the test file. Set a two-week SLA — if the test is still quarantined after two weeks, escalate it.

Step 3: Isolate Without Ignoring

The quarantine job runs the same test binary with a filter flag that selects only quarantined tests. It publishes results to the same history database. This is critical: you need continued data to know whether the flakiness has been fixed. A test exits quarantine when it passes ten consecutive runs without a failure.

“Quarantine is not deletion. The moment you stop running a flaky test, you lose the data you need to fix it and the coverage it was providing when it did pass.”

Configure your CI pipeline with two parallel jobs: test-blocking runs all non-quarantined tests and gates the merge, test-quarantine runs quarantined tests and reports results without gating. Both jobs publish JUnit XML artifacts that feed the history database.

Common Causes of Flakiness in Game Tests

Once you have the quarantine list, patterns emerge. The most common causes we see in game studios are: timing dependencies (a test assumes an animation completes in exactly N frames but frame timing varies), shared state (one test modifies a singleton that another test reads), file I/O races (save/load tests that don’t use unique temp directories), and network timeouts (integration tests against live backends). Fix these categories in that priority order — timing issues are the most common and usually the cheapest to resolve.

For timing issues, replace frame-count waits with condition-based polling: “wait until the animation state machine reports idle” instead of “wait 30 frames.” For shared state, reset singletons in a test fixture teardown. For file I/O, generate a unique temp path per test using the test name and a UUID. For network timeouts, mock the backend or increase the timeout and accept the slower suite.

Metrics and Accountability

Track three metrics weekly: the number of tests in quarantine, the average age of quarantined tests, and the total flaky failure rate across all tests. The first tells you whether flakiness is growing or shrinking. The second tells you whether the team is actually fixing quarantined tests. The third is your overall test health signal.

Post these numbers in your team’s weekly standup or Slack channel. Make the quarantine list visible. Developers who see their test in quarantine for three weeks running are motivated to fix it in a way that a CI retry button never motivates.

Related Issues

If your flaky tests are related to release pipelines, see How to Build a Pre-Release Checklist Generator. For structured logging that helps diagnose intermittent test failures in multiplayer scenarios, check How to Set Up Structured Logging for Multiplayer Games.

A quarantine list with zero tests on it is the goal. The system exists so you can get there without breaking your pipeline along the way.