What metrics should a game health dashboard track?

A game health dashboard should track crash rate (crashes per session), crash-free user rate, frame time percentiles (p50, p95, p99), memory usage trends, session length and retention, error rates by category, and loading times. The most important single metric is crash-free session rate, which should be above 99.5 percent for a healthy game.

How do I calculate crash-free session rate?

Crash-free session rate is calculated as (total sessions minus sessions with crashes) divided by total sessions, multiplied by 100. A session counts as crashed if the game process terminated abnormally during that session. Track this metric daily and break it down by platform and game version to quickly identify regressions.

What alerting thresholds should I set for a game health dashboard?

Set alerts when crash-free session rate drops below 99 percent, when a new crash signature appears affecting more than 50 users, when p95 frame time exceeds your target (e.g., 33ms for a 30fps game), and when average session length drops significantly compared to the previous week. Start with conservative thresholds and tighten them as you learn what is normal for your game.

How to Build a Game Health Dashboard

Quick answer: A game health dashboard tracks crash-free session rate, frame time percentiles, memory usage, session length, and error rates. Instrument your game client to send telemetry, aggregate it into time-series metrics, and set up alerts for regressions so you find problems before your players do.

After your game launches, you need visibility into how it’s running in the wild. Players don’t file bug reports for every crash or stutter — they just stop playing. A game health dashboard gives you real-time insight into stability, performance, and player experience across every platform and hardware configuration. It’s the difference between reacting to angry Steam reviews and proactively fixing problems before most players encounter them.

Choosing the Right Metrics

Not all metrics are equally useful. Focus on a small set of high-signal metrics rather than tracking everything you can think of. Here are the metrics that matter most for game health:

Crash-free session rate is your single most important stability metric. It’s the percentage of game sessions that complete without an abnormal process termination. A healthy game should maintain a crash-free session rate above 99.5%. Below 99%, players are having a noticeably bad experience and you should treat it as an emergency.

Frame time percentiles tell you about performance better than average FPS. Track p50 (median), p95, and p99 frame times. The p50 tells you the typical experience; the p95 and p99 reveal stutters and hitches that the average hides. For a 60fps target, your p95 frame time should stay below 20ms. If your p99 exceeds 50ms, players are experiencing noticeable hitches.

Memory usage should be tracked as peak memory per session and memory at key checkpoints (level loads, inventory open, etc.). Track the p95 peak memory to understand how close you are to the memory ceiling on constrained platforms. Memory leaks show up as a steady increase in memory over session duration.

Session length is a proxy for player engagement and can also indicate technical problems. If average session length suddenly drops, it might mean players are hitting a game-breaking bug at a specific point. Compare session length distributions before and after each update.

Loading times directly impact player experience. Track initial load, level transition loads, and fast-travel loads separately. Break these down by platform and storage type (HDD vs SSD) to understand the range of experiences your players have.

Instrumenting Your Game Client

Good dashboard data starts with good instrumentation. You need to collect telemetry from the game client without impacting performance or player experience. Here’s a practical approach:

// Lightweight telemetry collector
class HealthTelemetry {
    struct FrameSample {
        float frame_time_ms;
        float memory_mb;
        uint32_t draw_calls;
    };

    std::vector<FrameSample> samples;
    int sample_interval = 60; // Sample every 60 frames (once per second at 60fps)
    int frame_counter = 0;

public:
    void OnFrameEnd(float dt_ms, float mem_mb, uint32_t draws) {
        frame_counter++;
        if (frame_counter % sample_interval != 0) return;

        samples.push_back({dt_ms, mem_mb, draws});

        // Flush to backend every 5 minutes
        if (samples.size() >= 300) {
            FlushAsync(samples);
            samples.clear();
        }
    }

    void FlushAsync(const std::vector<FrameSample>& batch) {
        // Compute percentiles locally to reduce data volume
        auto sorted_ft = SortByFrameTime(batch);
        float p50 = sorted_ft[batch.size() / 2].frame_time_ms;
        float p95 = sorted_ft[(int)(batch.size() * 0.95)].frame_time_ms;
        float p99 = sorted_ft[(int)(batch.size() * 0.99)].frame_time_ms;
        float peak_mem = MaxMemory(batch);

        // Send aggregated data, not raw samples
        PostToBackend({
            {"session_id", session_id},
            {"p50_frame_ms", p50},
            {"p95_frame_ms", p95},
            {"p99_frame_ms", p99},
            {"peak_memory_mb", peak_mem},
            {"sample_count", batch.size()},
            {"platform", GetPlatform()},
            {"version", GetGameVersion()}
        });
    }
};

Key principles: sample periodically rather than every frame to minimize overhead, compute aggregates on the client to reduce data volume, batch network requests to avoid impacting gameplay, and always include the game version and platform so you can segment your data.

For crash detection, register an unhandled exception handler that captures the call stack, platform info, and game state at the time of the crash. On mobile and consoles, also detect ANRs (Application Not Responding) and out-of-memory kills, which the OS may terminate without triggering your crash handler.

Building the Metrics Pipeline

Raw telemetry data needs to be aggregated into queryable metrics before it’s useful in a dashboard. The typical pipeline looks like this: game clients send telemetry events to an ingestion endpoint, events are written to a time-series store, a background job computes per-minute and per-hour aggregates, and the dashboard queries the aggregated data.

For an indie studio, you don’t need a complex data pipeline. A simple approach is to have your API server receive telemetry events, write them to a MySQL table partitioned by date, and run hourly aggregation queries:

-- Hourly performance aggregation
INSERT INTO performance_hourly
    (game_id, hour, platform, version,
     p50_frame_ms, p95_frame_ms, p99_frame_ms,
     p95_memory_mb, session_count)
SELECT
    game_id,
    DATE_FORMAT(created_at, '%Y-%m-%d %H:00:00') AS hour,
    platform,
    game_version,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY p50_frame_ms) AS p50_frame_ms,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY p95_frame_ms) AS p95_frame_ms,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY p99_frame_ms) AS p99_frame_ms,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY peak_memory_mb) AS p95_memory_mb,
    COUNT(DISTINCT session_id) AS session_count
FROM telemetry_raw
WHERE created_at >= NOW() - INTERVAL 1 HOUR
GROUP BY game_id, hour, platform, game_version;

Retain raw data for 7–14 days for debugging, and keep hourly aggregates for 90 days or longer for trend analysis. This keeps storage costs manageable while giving you enough history to spot gradual regressions.

Dashboard Design and Layout

Your dashboard should answer the question “is my game healthy right now?” within 5 seconds of looking at it. Put the most critical metrics at the top: crash-free session rate as a large number with a sparkline showing the last 7 days, followed by p95 frame time with a trend indicator.

Below the headline metrics, show time-series graphs for each key metric. Include version release markers on the timeline so you can visually correlate metric changes with deployments. Add platform and version filters so you can drill down into specific segments.

A well-designed game health dashboard has these sections: a top-line status panel (green/yellow/red based on key metric thresholds), a stability section (crash-free rate, top crash signatures, crash trend), a performance section (frame time percentiles, memory usage, loading times), and a sessions section (active users, session length distribution, retention).

Alerting That Works

Dashboards are useless if nobody is looking at them when things go wrong. Set up automated alerts that notify your team when metrics cross critical thresholds. Start with these alerts and tune the thresholds based on your game’s baseline:

Crash rate spike: Alert when crash-free session rate drops below 99% in any 1-hour window. This catches sudden regressions from bad updates.
New crash signature: Alert when a previously unseen crash signature appears and affects more than 10 users. This catches new bugs introduced by code changes.
Performance regression: Alert when p95 frame time increases by more than 50% compared to the same hour yesterday. This catches performance regressions that might not cause crashes but degrade the experience.
Session length drop: Alert when average session length drops by more than 30% compared to the previous 7-day average. This is a lagging indicator but can reveal game-breaking bugs that don’t cause crashes.

Send alerts to wherever your team actually looks — Discord, Slack, or a dedicated on-call channel. Email alerts are too slow for game health issues. Include enough context in the alert to start investigating immediately: the metric value, the threshold it crossed, the affected platform and version, and a link to the relevant dashboard view.

Avoid alert fatigue by starting with conservative thresholds and tightening them gradually. If an alert fires and the team decides it wasn’t worth waking up for, adjust the threshold. An alert that fires too often gets ignored, which is worse than having no alert at all.

You can’t fix what you can’t see. Instrument first, optimize second.