Quick answer: Compute per-tick state checksums on every client and compare them each frame. When checksums diverge, use per-subsystem hash breakdowns to isolate which system drifted first. The root cause is almost always floating-point non-determinism, uninitialized memory, or iteration-order differences in hash maps. Fix it with fixed-point arithmetic, explicit initialization, and ordered collections.
Deterministic lockstep is the architecture that made RTS multiplayer possible. Instead of synchronizing the full game state every frame, you synchronize only player inputs and rely on every client simulating the same result from the same inputs. It is elegant, bandwidth-efficient, and utterly unforgiving. A single bit of divergence on tick 200 becomes a completely different game state by tick 2000. One player sees their army winning; the other sees it losing. This guide covers how to find where the divergence starts, why it happens, and how to prevent it.
Per-Tick State Checksums: Your First Line of Defense
The first rule of debugging desync is detecting it as early as possible. If you wait until a player reports that “the game did something weird,” you are thousands of ticks past the point of divergence and the trail is cold. Instead, compute a checksum of your entire game state on every simulation tick and exchange it between clients.
The checksum does not need to be cryptographically secure. A CRC32 or a simple XOR hash is fast enough to run every tick without measurable performance impact. The important thing is that it covers all mutable simulation state: unit positions, health values, resource counts, cooldown timers, random number generator state, and anything else that affects gameplay outcomes.
// Compute a per-tick checksum of the full game state
func compute_state_checksum(state: GameState) -> int:
var hash := 0x811c9dc5 # FNV-1a offset basis
for unit in state.units:
hash = fnv1a_combine(hash, unit.position.x_fixed)
hash = fnv1a_combine(hash, unit.position.y_fixed)
hash = fnv1a_combine(hash, unit.health)
hash = fnv1a_combine(hash, unit.state_id)
hash = fnv1a_combine(hash, state.rng_state)
hash = fnv1a_combine(hash, state.tick_number)
return hash
When two clients report different checksums for the same tick, you know the exact moment desync began. But knowing which tick is only the beginning — you still need to know which part of the state diverged. That is where subsystem checksums come in.
Subsystem Hash Breakdown: Narrowing the Search
A single global checksum tells you that something is wrong. Subsystem checksums tell you where. Instead of hashing the entire state into one value, compute separate hashes for each major subsystem: physics, unit state, economy, pathfinding, combat resolution, and the random number generator. When the global checksum diverges, compare the subsystem checksums to identify which system drifted first.
This is the difference between “desync happened on tick 4,712” and “the physics subsystem diverged on tick 4,712 while all other systems were still in agreement.” The second statement immediately tells you to look at collision resolution, force accumulation, or position integration — not at the economy or the AI.
In practice, desync rarely starts in isolation. A physics divergence on tick 4,712 will cascade into a combat divergence on tick 4,720 when a unit that should have been in range on one client is out of range on the other. Subsystem checksums let you trace the cascade backward to the origin. Always investigate the first subsystem that diverges, not the most obviously broken one.
Input Hash Logging: Ruling Out the Network Layer
Before you blame the simulation, make sure the inputs are actually identical. A common source of desync is not simulation non-determinism but input delivery errors: a dropped packet, a reordered message, an input that arrived on one client but not another. Hash the input buffer for each tick alongside the state checksum. If input hashes match but state hashes diverge, the problem is in the simulation. If input hashes diverge, the problem is in the network layer.
Log every input with its tick number, player ID, and a hash of the input payload. On desync detection, dump the input logs from both clients and diff them. You are looking for missing inputs, duplicate inputs, or inputs assigned to the wrong tick. These bugs are network bugs, not simulation bugs, and they require a completely different fix — typically improving your input confirmation protocol or adding redundant input transmission.
Floating-Point Canonicalization: The Hardest Problem
If your inputs match and your state still diverges, the most likely culprit is floating-point arithmetic. IEEE 754 guarantees identical results for basic operations (add, subtract, multiply, divide) on the same platform with the same rounding mode. It does not guarantee identical results across different CPUs, different compilers, different optimization levels, or even different instruction orderings on the same CPU. The fmadd instruction (fused multiply-add) produces a different result than separate multiply and add operations, and the compiler is free to substitute one for the other when fast-math is enabled.
// Fixed-point type to replace floats in simulation code
struct Fixed64 {
raw: i64, // 32.32 fixed-point representation
}
impl Fixed64 {
const FRAC_BITS: u32 = 32;
fn from_int(v: i32) -> Self {
Self { raw: (v as i64) << Self::FRAC_BITS }
}
fn mul(self, other: Self) -> Self {
// Full 128-bit multiply, then shift back
let wide = (self.raw as i128) * (other.raw as i128);
Self { raw: (wide >> Self::FRAC_BITS) as i64 }
}
fn add(self, other: Self) -> Self {
Self { raw: self.raw + other.raw }
}
}
The most robust solution is to eliminate floats from your simulation entirely and use fixed-point arithmetic. Fixed-point operations are integer operations, which are deterministic on every platform and every compiler. The cost is reduced range and precision compared to floats, but for game simulation — positions measured in tiles, health measured in integers, timers measured in ticks — fixed-point is more than sufficient.
If you cannot eliminate floats (because you are working with an engine that uses them pervasively), you need to canonicalize them. Disable fast-math in your compiler flags (-fno-fast-math in GCC/Clang, /fp:strict in MSVC). Force all floating-point operations to single precision by avoiding implicit promotions to double. Flush denormals to zero on all platforms at startup. And test across every target platform relentlessly, because a build that is deterministic on x86 may not be deterministic on ARM.
The Replay Divergence Tool: Your Most Powerful Weapon
When checksums tell you that tick 4,712 diverged in the physics subsystem, you need to find the exact variable that differed. This is where a replay divergence tool becomes invaluable. The tool works by recording the complete initial state and all inputs from a real game session, then replaying the simulation on two different machines (or the same machine with two different builds) and comparing the full state dump at every tick.
The state dump should be a structured text format — one line per variable, with the subsystem, entity ID, field name, and value. When you diff the dumps from two replays, the first line that differs is the origin of the desync. From there, you read the code path that writes that variable and look for non-determinism: a float operation, an uninitialized field, a hash map iteration, or a sort that does not have a stable tiebreaker.
Building this tool is not optional. It is the single most important debugging infrastructure for a lockstep game. Without it, you are guessing. With it, every desync is a solvable puzzle with a clear starting point. The tool pays for itself the first time it identifies a desync caused by HashMap iteration order — a bug that would have taken weeks to find by any other method.
Common Desync Sources and How to Eliminate Them
Beyond floating-point, several other patterns reliably cause desync in lockstep games. Uninitialized memory is the second most common: a struct field that defaults to zero on one platform and garbage on another. Always explicitly initialize every field. Hash map and set iteration order varies between platforms and even between runs — replace them with sorted arrays or ordered maps for any data structure that feeds into the simulation. System time, wall-clock timestamps, and locale-dependent string operations must never appear in simulation code. The random number generator must be seeded identically and advanced in the same order on all clients — use a single deterministic RNG for the simulation and a separate one for cosmetic effects.
Pointer-based sorting is another subtle trap. If you sort entities by memory address (even accidentally, through pointer comparison in a comparator), the order will differ between machines. Sort by a deterministic key like entity ID instead. Similarly, thread scheduling is non-deterministic — all simulation logic must run on a single thread or use deterministic synchronization barriers.
“Deterministic lockstep is not about writing deterministic code. It is about systematically eliminating every source of non-determinism from a codebase that is non-deterministic by default.”
Finally, test determinism continuously. Add a CI job that replays recorded sessions on multiple platforms and compares checksums. Every commit should prove that the simulation remains deterministic. A desync bug that slips into the codebase undetected for two weeks is exponentially harder to find than one caught on the day it was introduced.
Related Issues
For a deeper dive into floating-point issues across CPUs, see our guide on debugging floating-point determinism across platforms. For setting up the CI infrastructure to catch these regressions automatically, read how to build automated smoke tests for game builds.
If two clients can run the same inputs and get different results, you do not have a networking problem. You have a determinism problem. Fix it at the source.