Why does a single matchmaker crash affect so many players?

Because a matchmaker coordinates many clients at once and holds every queue in memory. When it crashes, that state is lost, so every player mid-formation is orphaned into a queue that will never resolve. The blast radius is large by design, which is why capturing the queue state at the crash matters so much.

How do I tell a relay crash from a matchmaking crash when both look like a frozen loading screen?

Attach service-specific context. A matchmaker report carries queue sizes, party composition, and the formation stage that was executing; a relay report carries the session id, active connection count, and node. The stranded player maps to whichever service holds the matching context, so you localize the fault instead of guessing.

Should I report stuck queues that are not actual crashes?

Yes. A matchmaker can be healthy and still strand players because the population is thin or a backfill is starving. Emit a report when a formation attempt exceeds a time budget or abandons a player, so non-crash failures appear beside real crashes and you can tell a population problem from a code regression.

Crash Reporting for Relay and Matchmaking Services

Quick answer: Relay and matchmaking services sit on the critical path to every multiplayer session, so when they crash they do not just fail quietly, they strand players in stuck queues or drop live connections. The fix is to capture crashes with the queue and match-formation state attached: party composition, queue wait time, the relay session id, and which stage of match formation failed, so a stranded player maps to a specific, reproducible fault.

Relay servers and matchmakers are the unglamorous plumbing of multiplayer games, and when they break, players feel it immediately: a queue that never resolves, a match that forms and then instantly dies, a session that drops everyone halfway through. Because these services coordinate many clients at once, a single crash can affect dozens of players, and the symptom they report, stuck in queue, rarely points at the real fault. This post covers how to capture crashes in relay and matchmaking services, what queue and match-formation context to attach, and how to tell a relay failure apart from a matchmaker failure when both look like a frozen loading screen.

Failures here strand players, they do not just log

A crash in your game client annoys one player. A crash in your matchmaker can leave an entire pool of players sitting in a queue that will never resolve, because the service that was supposed to pair them is gone. A relay crash drops every connection routing through that node at once, ending live matches for everyone on it. The blast radius is what makes these services worth instrumenting carefully: one fault becomes many bad experiences, and players blame the game, not the queue.

These services are also stateful in awkward ways. A matchmaker holds the current contents of every queue in memory; a relay holds the routing tables for active sessions. When the process dies, that state is lost, so even if the service restarts cleanly, the players who were mid-formation are orphaned. Capturing what the service was holding at the moment it crashed is the only way to understand why a particular cohort of players got stranded rather than just that the process exited.

Capture the queue state at the crash

For a matchmaker crash, the stack trace is half the story; the other half is the queue. Attach the size of each queue, the composition of the match being formed when it died, party sizes, skill brackets, region constraints, and how long the affected players had been waiting. A matchmaker that crashes only while forming a five-stack across two regions is a very different bug from one that dies on a trivial solo queue, and the difference is invisible without the formation context.

Watch for the formation stage too. Match formation usually runs as a pipeline: gather candidates, check constraints, reserve a server, confirm players, hand off to the relay. Recording which stage was executing when the crash hit narrows the search enormously. A crash during server reservation points at your allocation backend; a crash during player confirmation points at a client that disconnected at the wrong moment. Capture the stage and you have already localized the fault.

Relay sessions need their own context

Relay crashes call for different attachments. Record the relay session id, the number of active connections being routed, the throughput at the time, and the protocol details that matter for your game, packet rates, channel counts, whatever you multiplex. A relay that crashes under high connection counts is hitting a scaling limit; one that crashes on a specific packet pattern has a parsing bug. The connection count at crash time usually separates the two immediately.

Tie the relay session back to the match that created it, so a dropped session connects to the matchmaking decision that placed those players on that node. This cross-reference is what lets you answer the real question, why did these specific players get dropped, rather than the shallow one, why did this process exit. Because relays are pooled and reused, also capture which node and region handled the session, so you can spot a single bad node masquerading as a code bug.

Separating queue stalls from real crashes

Not every stuck queue is a crash. A matchmaker can be perfectly healthy and still fail to form matches because the population is too thin, the skill spread is too wide, or a backfill request is starving. If you only capture crashes, you will miss these, and players will report being stuck with no corresponding error in your dashboard. Instrument the non-crash failure too: emit a report when a match-formation attempt exceeds a sane time budget or abandons a player.

Treating timeouts and abandonment as reportable events, alongside hard crashes, gives you one view of why players are stranded regardless of the mechanism. Grouping then does the heavy lifting: a spike in a single formation-timeout signature during peak hours is a population problem, while a spike in a crash signature right after a deploy is a code regression. Capturing both lets you tell them apart instead of conflating every frozen queue into one mystery.

Setting it up with Bugnet

Bugnet gives your relay and matchmaking services a single place to report into, with the queue and session context attached automatically alongside each stack trace. Hook the SDK into the crash handlers of both services, then add custom fields for queue sizes, party composition, formation stage, relay session id, active connection count, and the node and region. The in-game report path and the server-side crash path land in the same dashboard, so a player who reports being stuck lines up with the matchmaker crash that stranded them.

Because identical failures fold into one issue with an occurrence count, a relay node that starts dropping sessions shows up as a single climbing signature rather than a flood of disconnect reports. Filter by formation stage to localize a matchmaker bug, by region to catch a bad relay node, or by party size to reproduce a stacking-specific crash. Player attributes let you see exactly which cohort got stranded, turning the vague complaint of broken matchmaking into a precise, prioritized fix.

Test the failure paths deliberately

Relay and matchmaking bugs love conditions you rarely hit in casual testing: full queues, mixed-region parties, nodes near their connection ceiling, players disconnecting mid-formation. Build load tests that recreate these deliberately and watch which crash signatures appear, so you find the scaling and timing faults before your players do at launch. The context you attach in production should be the same context your tests assert on, so a test failure reproduces a real report exactly.

Make a habit of reviewing stranded-player reports as a class, not one at a time. The patterns, a particular formation stage, a particular region, a particular party shape, will tell you where your plumbing is thinnest. Multiplayer lives or dies on whether players can reliably get into a match and stay in it, so the teams that instrument the matchmaker and relay as carefully as the gameplay are the ones whose players never see the queue at all.

A relay or matchmaker crash strands many players at once. Capture what the service was holding, not just that it died, or you will never know who got stuck.