How can cloud saves corrupt silently?

Several ways: a partial upload completes without the tail of the file, the client clock is wrong and a newer save is overwritten by an older one, a 'last write wins' conflict resolution picks the wrong side, or a sync happens mid-save while the game is still writing. None of these throw errors; the file just ends up inconsistent, and the player discovers it hours later when a quest breaks.

Should I checksum every save file?

Yes. Append a SHA-256 hash to every save file before uploading, and verify it on load. This catches 100% of bit corruption from partial uploads or disk errors. The computation cost is negligible (sub-millisecond for a typical save) and the detection value is enormous. Include the expected size in the metadata too, so truncation is caught even if the hash happens to match.

How do I debug a player whose save is corrupted?

You need to know exactly what got corrupted and when. Attach a schema version, write timestamp, and hash to every save. When a player reports corruption, have them upload the save file (or the relevant metadata) and compare against what your server knows. The mismatch will point to the specific failure: clock drift, conflict resolution, or a partial write.

How to Monitor Save Data Integrity Across Cloud Sync

Quick answer: Add a header to every save file containing schema version, write timestamp, byte count, and SHA-256 hash of the payload. Verify all four on load. Log mismatches to your crash reporter with enough context to diagnose. This catches 99% of cloud sync corruption before the player ever sees a broken save.

Cloud saves are magical when they work. A player starts a campaign on their desktop, continues on their Steam Deck during a flight, finishes on their laptop, and the progress follows them seamlessly. When they break, though, they break badly: a quest vanishes, an inventory empties, a hundred hours of progress disappears. The worst part is that cloud sync failures are often silent — the game loads the file, plays the corrupted state, and the player only notices when something does not match their memory. Here is how to catch it early.

How Cloud Saves Go Wrong

There are five common failure modes:

1. Partial upload completion. The cloud client (Steam, Epic, iCloud, Play Games) starts an upload, transfers most of the bytes, then loses connection. The remote file is left truncated. The next download on a different device gets the truncated version and loads it as if it were complete. If your save format tolerates truncation (e.g., has a tail section), the game runs with half the state missing.

2. Clock skew. Cloud sync uses "last write wins" based on file modification time. If one of the player's devices has a clock that is an hour behind, writes from the slower-clocked device are ignored because they look older than the existing file. Any progress made on that device is lost on next sync.

3. Conflict resolution race. The player saves on Device A while Device B is mid-sync. Device B uploads its older version last, overwriting Device A's newer save. Some platforms handle this correctly; others do not, and the one that fails is usually the one you did not test on.

4. Schema drift. A new game version writes saves in an updated format. The player opens the game on an unpatched device, which loads the new-format save with the old schema reader and either crashes or silently resets fields it does not recognize.

5. Concurrent writes. The game is actively writing a save when the cloud sync decides to upload it. The uploaded file is the in-progress half-written state. Most platforms have a "don't sync while the game is running" policy but the boundary is racy, especially during manual saves.

Design the Save Header for Verification

Every save file should start with a header that makes integrity verification possible. A minimal header:

struct SaveHeader {
    char      magic[4];       // "GSAV"
    uint32    version;        // schema version
    uint32    build_number;   // game build that wrote this save
    uint64    written_at_ms;  // UTC milliseconds when written
    uint64    payload_bytes;  // size of body after header
    char      payload_sha256[32];  // hash of body
    char      header_sha256[32];   // hash of all fields above
};

On load, compute: does the body size match? Does the body hash match? Does the header hash match? Any failure is a corruption signal. Abort the load, preserve the file for diagnostic upload, and fall back to the previous good save.

Always Keep a Previous Good Save

Never overwrite a save file directly. Write the new save to a temporary file, verify it loaded back correctly, then atomically rename it over the old file. Keep the old file for one rotation as a fallback:

func (g *Game) WriteSave(data SaveData) error {
    tmp := "save.tmp"
    if err := writeWithHeader(tmp, data); err != nil {
        return err
    }

    // Round-trip verification: load what we just wrote
    if _, err := loadAndVerify(tmp); err != nil {
        os.Remove(tmp)
        return fmt.Errorf("round-trip verify failed: %w", err)
    }

    // Rotate: save.current -> save.backup, tmp -> save.current
    os.Rename("save.current", "save.backup")
    return os.Rename(tmp, "save.current")
}

On load, try save.current first. If it fails verification, fall back to save.backup and log a corruption event. The player loses at most one session of progress instead of everything.

Report Corruption to Your Crash Tool

When verification fails, do not just silently fall back. Emit a structured event that includes everything you need to investigate:

report := CorruptionReport{
    PlayerID:     player.ID,
    Platform:     runtime.GOOS,
    FileSize:     stat.Size(),
    ExpectedSize: header.PayloadBytes,
    SchemaVersion: header.Version,
    BuildNumber:  header.BuildNumber,
    WrittenAtMs:  header.WrittenAtMs,
    CurrentClockMs: NowUnixMs(),
    ClockSkewMs:  NowUnixMs() - header.WrittenAtMs,
    HashMatch:    header.PayloadHash == Computed,
    HeaderMatch:  header.HeaderHash == Computed,
}
bugnet.Capture(report)

Over time this gives you an aggregated picture of where corruption happens: which platform, which save slot, which game version, whether the clock was skewed. The patterns show up in the aggregate even if no single player could tell you what happened.

Detect Clock Skew Proactively

If your save header says the file was written 3 hours in the future, something is wrong. Clamp writes against a sanity check:

func SanityCheckClock(h SaveHeader) {
    now := NowUnixMs()
    skew := now - h.WrittenAtMs

    if skew < -5*60*1000 {
        // Save is more than 5 min in the future
        slog.Warn("save timestamp is in the future",
            "skew_ms", skew)
        bugnet.Capture("save_clock_future", skew)
    }
    if skew > 365*24*3600*1000 {
        // Save is more than a year old
        slog.Warn("save timestamp is suspiciously old",
            "skew_ms", skew)
    }
}

A future-dated save is a strong signal that the clock on the originating device was set wrong, which almost always leads to a sync loss later. Some games display a warning to the player ("your device clock is off — fix it or you may lose save progress") which is an effective way to catch the issue before it causes damage.

Handle Conflicts With a Resolution UI

When a player has two saves from different devices that are genuinely both valid (one offline on a plane, one at home), "last write wins" will lose one. Instead, detect the conflict and ask the player:

if IsCloudConflict(local, remote) {
    ShowDialog(ConflictDialog{
        LocalTime:    local.WrittenAt,
        LocalPlaytime: local.PlaytimeHours,
        RemoteTime:   remote.WrittenAt,
        RemotePlaytime: remote.PlaytimeHours,
        OnKeepLocal:  func() { UploadLocalOverwrite() },
        OnKeepRemote: func() { DownloadRemoteOverwrite() },
        OnKeepBoth:   func() { KeepBothAsSeparateSlots() },
    })
}

This is more work than silent resolution but it is the only way to not lose genuine progress. "Keep both" as a third option is worth the effort because it defers the decision to a moment when the player can actually look at both saves and decide.

Run a Corruption Canary

Run a background job in your live ops that samples save files and checks them for integrity. Not every save — a random 1% sample is enough. Alert when the corruption rate exceeds a threshold:

sample := saves.SampleRecent(1000)  // last 1000 saves
corrupt := 0
for _, s := range sample {
    if _, err := Verify(s); err != nil {
        corrupt++
    }
}
rate := float64(corrupt) / float64(len(sample))

if rate > 0.001 {  // more than 0.1%
    alert.Fire("save_corruption_rate_high", rate)
}

This catches new corruption sources before they affect a critical mass of players. A spike in corruption after a patch is almost always a save format bug that needs a hotfix.

"Save corruption is the bug class that players never forgive. Lose their progress once and they will stop playing. Detecting corruption before it causes data loss is worth any amount of engineering time."

Related Issues

For migrating between save format versions see how to debug save file migration bugs. For debugging save corruption patterns that trace to game logic, read how to debug game save corruption bugs.

Header, hash, backup, verify. Four steps to save integrity you cannot regret.