Can you run meaningful performance tests on shared CI runners?

No. Shared runners have too much variance from neighboring workloads. Run performance tests on dedicated self-hosted runners with fixed hardware, or use cloud instances reserved for benchmarks only. Even then, track the 99th percentile frame time rather than the average, because percentiles are more stable across runs.

What metric should trigger a performance regression failure?

Compare the 99th percentile frame time against a baseline from the previous release. Fail the build if the new number is more than 10 percent worse. Also track mean frame time, GPU time, and memory peak, but treat only the p99 frame time as blocking.

How long should a performance benchmark scene run?

At least 30 seconds of real gameplay after a 5-second warm-up to let caches and JIT stabilize. Repeat three runs and take the best to filter out OS noise. Anything shorter produces unreliable percentiles.

How to Test Game Performance Regression in CI

Quick answer: Build a scripted benchmark scene, run it on dedicated CI hardware with fixed CPU frequency, capture per-frame times, compute the 99th percentile, and fail the build when the p99 frame time is more than 10 percent worse than the previous baseline. Do not trust shared CI runners for timing data.

Every game team has lived through the same story: a feature lands that looks fine locally, but two weeks later a player reports frame drops that the team cannot reproduce. The fix turns out to be trivial, but the cost of finding it ran into weeks of debugging, and by then three more regressions have piled on top. Continuous performance testing exists to catch these bugs at the moment the code that caused them is reviewed. Done right, it prevents slow compounding decay where your game ends up 30 percent slower than it was six months ago without anyone noticing along the way.

Build a Benchmark Scene

You need a scene that exercises real gameplay systems under a reproducible camera path. A static empty level tells you nothing. A full gameplay slice is too variable. What works is a scripted scenario: spawn a fixed set of enemies and particles, move the camera along a predetermined spline, trigger a scripted combat sequence, and let it run for 30 seconds.

public class PerfBenchmarkScene : MonoBehaviour {
    public Transform[] cameraWaypoints;
    public float durationSeconds = 30f;

    List<float> frameTimes = new();

    IEnumerator Start() {
        // 5 second warm-up
        yield return new WaitForSeconds(5f);

        float t0 = Time.realtimeSinceStartup;
        while (Time.realtimeSinceStartup - t0 < durationSeconds) {
            frameTimes.Add(Time.unscaledDeltaTime * 1000f);
            yield return null;
        }

        File.WriteAllText("perf-result.json",
            JsonUtility.ToJson(new Result(frameTimes)));
        Application.Quit();
    }
}

Two things are load-bearing here. First, the warm-up period. JIT compilation, texture streaming, and GPU pipeline state take a few seconds to stabilize. Any timing captured in the first 3–5 seconds is noise. Second, determinism. The same inputs must produce the same work. If your benchmark scene uses random seeds, AI nondeterminism, or network traffic, your percentiles will wander from run to run.

Run on Dedicated Hardware

Shared CI runners are a disaster for performance data. GitHub-hosted runners, default GitLab runners, and any other “cloud” provider use shared hosts where a noisy neighbor can double your frame times without warning. The variance will swamp your signal.

The fix is a self-hosted runner on dedicated hardware. A single mid-range PC or Mac mini is enough for most indie teams. Pin the CPU frequency to a fixed value in the BIOS (disable turbo boost, disable power-saving C-states), pin the GPU clocks if possible, and keep the machine dedicated to CI. Never use it as a developer workstation.

# Pin CPU governor to performance on Linux runner
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
    echo performance | sudo tee $cpu
done

# Disable turbo to reduce variance
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

For teams that cannot justify dedicated hardware, cloud instances sold with fixed SKUs (AWS metal instances, dedicated GCP machines) are acceptable if you reserve them for benchmark runs and schedule them sparsely enough that the provider does not migrate you mid-test.

Use Percentiles, Not Averages

The single most important lesson in game performance measurement: averages lie. A game that runs at 16.6ms average with occasional 50ms spikes feels terrible, but the average is only a little worse than 16.6ms. Spikes are what players feel, and percentiles capture spikes while averages hide them.

Track three numbers: mean frame time, 95th percentile, and 99th percentile. The 99th percentile is the most sensitive to regressions. Alert on p99 changes of 10 percent or more. Also track the count of frames above 33ms (the “hitch count”), because 33ms frames drop FPS below 30 and are visible.

def compute_stats(frame_times_ms):
    sorted_ft = sorted(frame_times_ms)
    n = len(sorted_ft)
    return {
        "mean": sum(sorted_ft) / n,
        "p95": sorted_ft[int(n * 0.95)],
        "p99": sorted_ft[int(n * 0.99)],
        "hitches": sum(1 for f in sorted_ft if f > 33),
    }

Compare Against a Baseline

A single run has no meaning without a reference point. Store a rolling baseline: the previous release’s results, or the last 10 main-branch runs averaged. Commit these baselines to the repo under ci/baselines/ so they are versioned alongside code. When a PR runs, compare its numbers to the baseline and fail the build on regression.

Be strict on direction: treat any regression > 10 percent as a blocker. For improvements, log them and update the baseline automatically after the PR merges. This keeps the baseline ratcheting down over time instead of drifting up.

Report Results on the PR

A CI failure with no explanation is a frustration. A CI comment that says “p99 frame time regressed 18ms -> 24ms (33% slower) on benchmark scene Forest” tells the author exactly what to investigate. Post the full stats table as a PR comment, and link to the raw frame-time JSON for deeper analysis.

Run Multiple Scenes

A single benchmark scene can’t tell you which subsystem regressed. Keep a small battery of scenes, each stressing a different bottleneck: dense particle effects, large crowd AI, physics-heavy destruction, heavy shader work. When one regresses and others don’t, you know where to look before you even start profiling.

“We added perf CI six months ago. In that time we caught fourteen regressions before merge, including one that would have tanked frame rates on half our target hardware. The setup cost was one week of engineering and one dedicated Mac mini sitting in the closet.”

Related Issues

For broader CI setup, see how to measure code coverage in game projects. To trace regressions post-ship, read how to track and reduce crash rate over releases.

Run one benchmark scene on a dedicated runner this week. Even a single scene comparing p99 frame time to the previous release will catch regressions you would otherwise miss for weeks.