Quick answer: Intermittent crashes seem random but actually depend on conditions you haven't identified yet, commonly a race condition (timing-dependent), memory corruption (a crash far from its cause), or a rare combination of state. Because any single occurrence is hard to catch, you fix them by capturing many from the field and finding the shared pattern, the hardware, action, or state they have in common, which is the hidden condition to address.

Intermittent crashes, the ones that happen occasionally, unpredictably, and refuse to reproduce on demand, are among the hardest bugs to fix. They feel random, which makes them feel hopeless. But 'intermittent' almost never means truly random; it means the crash depends on a condition you haven't found yet. The path to fixing them is to find that condition, and since you can't reliably trigger the crash, you find it in aggregate.

What Makes a Crash Intermittent

An intermittent crash depends on conditions that aren't always met, so it only happens sometimes. The usual causes: race conditions, the crash depends on the timing of concurrent operations, so it only manifests in the rare bad ordering (and observation often masks it, a heisenbug). Memory corruption, something corrupts memory, and the crash happens later in unrelated code when that memory is used, so the crash location and timing seem random. And rare state combinations, the crash needs a specific, uncommon combination of conditions that occurs only occasionally.

In all cases, the crash is deterministic given its hidden condition; it just appears random because the condition is unidentified and uncommon. So the problem isn't really 'a crash that happens randomly', it's 'a crash whose triggering condition I haven't found.' That reframing points the way: find the condition.

How to Find the Hidden Condition

You can't reliably catch an intermittent crash one instance at a time, so catch it in aggregate. Capture every occurrence from the field with full context, stack trace, device, state, recent logs, and study the pattern across many. Any single occurrence is a mystery, but dozens of occurrences may reveal what they share: the same hardware, the same preceding action, the same memory pressure, the same subsystem, which is the hidden condition.

Bugnet captures this context automatically and groups occurrences by signature, which is exactly what makes intermittent crashes tractable: the grouped collection of occurrences exposes the correlation that no single instance shows. If every occurrence of an intermittent crash involves the same threaded subsystem, or the same hardware, or the same rare state, you've found the condition. The aggregate turns 'random' into 'happens under these specific circumstances,' which is the diagnosis.

How to Fix It

Once you know the condition, fix accordingly. For a race condition, add proper synchronization or ordering so the timing-dependent bad case can't occur, make the concurrency correct rather than relying on lucky timing. For memory corruption, find the corrupting code (sanitizers like AddressSanitizer are invaluable here, they catch the corruption at its source rather than where it later crashes) and fix the out-of-bounds write, use-after-free, or buffer overrun. For a rare state combination, handle the edge case the trace and condition reveal.

Verify without relying on reproduction: since you can't reliably trigger the crash, confirm the fix by watching the field, with version-tagged reporting, ship the fix and check whether the intermittent crash's occurrences drop to zero on the new version. If a crash that was steadily (if occasionally) accumulating stops on the fixed build, it's resolved. This field-driven loop is how you fix and verify the crashes you can't catch in the act.

Intermittent crashes aren't random, they depend on a condition you haven't found. You can't catch them one at a time, so capture many and let the pattern reveal the hidden cause.