Should I retry every failed network request?

No. Only retry idempotent operations and transient failures. A 500 or a connection reset is retryable. A 400 or a 403 is a bug on your side that will fail again. A POST that charged a player once must not be silently retried without an idempotency key.

What is jitter and why does it matter?

Jitter is random variation added to retry delays. Without it, every client that failed at the same time retries at the same time, creating thundering herd spikes that take the service down again. With jitter, retries spread out across time and the service recovers.

When should the client give up and show an error?

After a budgeted number of attempts or a time cap. A cloud save that has retried for 60 seconds should stop and tell the player. Silent infinite retries drain batteries, fill logs, and hide real outages from your monitoring.

How to Design a Retry Strategy for Flaky Network Ops

Quick answer: Classify errors as retryable or permanent, retry only the former with exponential backoff plus full jitter, cap total retry time, and use idempotency keys for anything that mutates state. Add a circuit breaker to stop hammering a service that is visibly broken, and always surface a clear message to the player when retries are exhausted.

Network operations in a game fail constantly. A phone switches from WiFi to LTE, a home router reboots, a CDN node fails a health check, a mobile carrier throttles a connection. These are transient problems and most of them resolve in seconds. A good retry strategy turns them into invisible hiccups. A bad one turns them into lost save data, double-charged microtransactions, or a UI that spins forever while the player wonders if they need to restart the game.

Classify Errors First

The worst retry strategy is “retry everything that failed.” Most protocol errors are not transient. A 400 Bad Request means your client sent something wrong; retrying will fail again and wastes battery. A 403 means the player’s token expired; retrying without refreshing the token is pointless. Build an error classifier once, at the HTTP client layer, so every caller sees the same decision.

enum ErrorClass { Retryable, Permanent, AuthRefresh };

static ErrorClass Classify(int status, Exception ex) {
    if (ex is TimeoutException or SocketException) return Retryable;
    if (status == 401 or status == 403) return AuthRefresh;
    if (status == 408 or status == 429) return Retryable;
    if (status >= 500 and status < 600) return Retryable;
    if (status >= 400) return Permanent;
    return Permanent; // 2xx/3xx should not reach here
}

Treat 429 Too Many Requests specially. If the server sent a Retry-After header, honor it exactly. Retrying faster than the server asked guarantees further rate limiting and can get your IP blocklisted by the platform operator.

Exponential Backoff with Full Jitter

Fixed-interval retries cause thundering herds. When a service comes back up after a short outage, every client that failed retries at the same moment, spiking load and crashing the service again. Exponential backoff spreads retries across time, and jitter smears them further.

async Task<T> RetryAsync<T>(Func<Task<T>> op, RetryOptions opt) {
    var rng = new Random();
    var deadline = DateTime.UtcNow + opt.MaxTotal;
    int attempt = 0;
    while (true) {
        try { return await op(); }
        catch (Exception ex) when (Classify(ex) == ErrorClass.Retryable) {
            attempt++;
            var cap = Math.Min(opt.MaxBackoff, opt.Base * Math.Pow(2, attempt));
            var delay = TimeSpan.FromMilliseconds(rng.NextDouble() * cap);
            if (DateTime.UtcNow + delay > deadline) throw;
            await Task.Delay(delay);
        }
    }
}

Full jitter (a uniform random between 0 and the cap) outperforms other jitter schemes in practice. AWS published the math years ago and it remains the default across most client libraries. Do not reinvent it.

Idempotency Keys for Mutations

A retry on a read is free. A retry on a write can cause duplicate purchases, duplicate leaderboard submissions, or double-applied patch notes. For every non-idempotent operation, generate a client-side UUID and send it as a request header. The server deduplicates on the key so a retried request produces the same effect as the first request.

// Client
var idempotencyKey = Guid.NewGuid().ToString();
await RetryAsync(() => http.PostAsync("/purchase", payload,
    headers: { ["Idempotency-Key"] = idempotencyKey }));

// Server
if (db.TryGetResult(idempotencyKey, out var prior)) return prior;
var result = ProcessPurchase(...);
db.StoreResult(idempotencyKey, result, ttl: TimeSpan.FromHours(24));
return result;

Key generation must happen once per logical request, not once per retry attempt. Pass the same key through all attempts. If you regenerate the key on retry, you lose deduplication and the whole scheme collapses.

Circuit Breakers for Real Outages

When a service is down for real, retries do not help. They waste battery on the client and load on your servers. A circuit breaker tracks recent failure rate and, when it crosses a threshold, stops allowing requests entirely for a cooldown period. After the cooldown it allows a single probe; if the probe succeeds, the circuit closes again.

Keep per-endpoint circuits. A dead leaderboard endpoint should not shut down matchmaking. And never let circuits affect critical paths like crash reporting or auth refresh — those need to keep trying even during an outage so recovery is smooth.

Tell the Player What Is Happening

Long silent retries feel like a bug. If the first retry fails, update the UI: “Saving...” becomes “Connection issues, still trying.” If the budget runs out, surface a clear error with options: “Retry now,” “Save locally,” “Continue offline.” Never retry forever without feedback. Players will force-quit, which is worse than failing cleanly.

“We shipped a cloud save retry loop with no budget. When our auth provider had a brief outage, the save system retried 30,000 times on some clients and filled their logs with 80 MB of noise. We added a 60-second budget and the bug reports stopped immediately.”

Related Issues

For client-side resilience patterns, see how to build an in-game debug console to inspect retry behavior live. For network testing, read how to test your game on slow network connections.

Add a retry budget and a player-facing message to your next network feature. Players forgive failure; they do not forgive a UI that spins forever.