Quick answer: Write to a temp file, fsync, then rename over the live save. Add a checksum header to detect partial writes. For console-grade resilience, alternate between two save slots with a tiny pointer file.

A player saves the game. Their laptop battery dies five seconds later. They charge up, relaunch, and the “Continue” option shows the save but loading it crashes the game — or worse, succeeds with corrupted state. They lose 8 hours of progress because of an unfortunate race between “writing” and “done”.

Why Naive Saves Corrupt

The typical save code:

def save(state):
    with open("save.dat", "wb") as f:
        f.write(serialize(state))

The OS doesn’t flush to disk immediately on close in many cases. Data sits in OS write cache for milliseconds to seconds. If power dies before the cache is committed, the file on disk is truncated, half-written, or unmodified depending on timing. The same applies to fwrite on platforms with buffered I/O.

Fix 1: Atomic Write-Rename

import os, struct, hashlib

def save_atomic(path, data):
    tmp = path + ".tmp"
    with open(tmp, "wb") as f:
        f.write(data)
        f.flush()
        os.fsync(f.fileno())   # force commit to disk
    os.replace(tmp, path)        # atomic rename

Steps:

  1. Write all data to a temp file.
  2. fsync the file so its contents are on disk, not just in cache.
  3. Rename the temp file over the real save. The OS guarantees this rename is atomic.

If power dies mid-write to the temp file, the real save is untouched. If power dies between fsync and rename, the real save is untouched. Only after the rename completes is the new save visible to readers.

Fix 2: Checksum Header

Even with atomic writes, application bugs can produce malformed save data. Detect:

def pack(state) -> bytes:
    body = serialize(state)
    checksum = hashlib.sha256(body).digest()[:8]
    version = struct.pack("<I", SAVE_VERSION)
    return version + checksum + body

def unpack(blob):
    version = struct.unpack("<I", blob[:4])[0]
    checksum = blob[4:12]
    body = blob[12:]
    if hashlib.sha256(body).digest()[:8] != checksum:
        raise CorruptSave()
    return deserialize(body, version)

On load, verify the checksum before deserializing. A corrupted file is detected explicitly rather than crashing the parser deep in nested struct reads.

Fix 3: Double-Buffered Slots

Maintain two save slots and a tiny pointer file:

def save_double_buffered(state):
    cur = read_pointer()      # 0 or 1
    nxt = 1 - cur
    save_atomic(f"save_{nxt}.dat", pack(state))
    save_atomic("save_pointer", struct.pack("<I", nxt))

Loading reads the pointer, tries that slot, falls back to the other if it’s corrupt. Worst case the player loses one save’s worth of progress (a few minutes) but never the entire campaign.

Fix 4: Don’t Save Mid-Frame

If the player presses Save and you read their entity state while another thread is mid-write to that state, you serialize garbage. Snapshot in a single tick before serializing:

def save_snapshot(world):
    snapshot = world.copy_state()   # synchronously, on game thread
    threading.Thread(target=lambda: save_double_buffered(snapshot)).start()

The expensive disk write runs on a background thread; the snapshot is taken atomically on the game thread, eliminating concurrent-mutation corruption.

Verifying

Simulate power loss: kill the game process at a random offset during a save with kill -9. Repeat many times across a save loop. Load after each kill — the load should always succeed, either with the new save or the previous one. If you ever see a load fail or load corrupt data, your atomicity has a hole.

“Saves should survive every conceivable interruption. Atomic write, checksum, double-buffer — layers of protection are cheap; lost progress is not.”

Include a save corruption automated test in your CI — spawn the game, save, kill at random instants, verify load works. Five minutes of CI catches a year of player complaints.