Quick answer: Render pipeline stalls happen when the CPU and GPU block on each other through fences, resource barriers, or synchronous uploads. Capture a frame in RenderDoc or PIX, line up the CPU and GPU timelines, and look for gaps. The fix is usually one of: async command list recording, moving uploads onto a copy queue, or batching draw submissions.

A render pipeline stall is one of the hardest performance bugs to catch because the symptom — frame time spikes — can mean a dozen different things. The GPU might be idle waiting for the CPU to finish recording commands. The CPU might be blocked on a fence waiting for the GPU to finish a prior frame. A copy queue might be serialized behind the graphics queue because of a missed barrier. None of this shows up in a simple CPU profiler. You need to look at both pipelines side by side, and you need the right tools to do it.

Capture a Frame That Actually Contains the Stall

The first mistake most people make is capturing the wrong frame. A stall that happens once every ten seconds does not show up in a random capture. Add a hotkey or telemetry hook that triggers a capture when your frame time exceeds a threshold. RenderDoc supports programmatic captures through its in-application API, and PIX has PIXCaptureTiming for timing captures that include CPU/GPU correlation.

// RenderDoc programmatic capture when frame exceeds budget
if (frameTimeMs > 33.0f && rdoc_api) {
    rdoc_api->TriggerCapture();
    LogWarning("render.stall",
        "frame_ms", frameTimeMs,
        "scene", currentScene);
}

Timing captures are more useful than single-frame captures for stall debugging. A single frame shows you the draw calls but not the relationship between submission and execution. A timing capture covers several frames and shows CPU submission events aligned with GPU execution events, which is what you actually need.

Read the PIX Timeline Like a Detective

Open the timing capture and arrange the view so the CPU submission thread is at the top and each GPU queue is below it. You are looking for two patterns. The first is CPU idle with GPU busy: the CPU submitted a frame, finished its next frame of simulation, and is now blocked on Present because the swap chain is full. This means you are GPU-bound and the fix is to reduce GPU work.

The second pattern is GPU idle with CPU busy: the GPU finished its frame and now sits idle while the CPU records the next command list. This is a CPU submission bottleneck and the fix is usually to record command lists in parallel or to batch draws more aggressively. If you see both patterns alternating, you have a deeper synchronization problem, often caused by an explicit fence wait mid-frame.

Move Command Recording Off the Main Thread

Modern graphics APIs (D3D12, Vulkan, Metal) support parallel command list recording, but many engines still record on the main thread. If your CPU submission takes 6 ms on a 16 ms frame budget, moving recording to worker threads can cut half that time. Split your scene into logical buckets (opaque, transparent, shadow, UI) and record each on a separate thread, then submit them in the correct order on the main thread.

// Parallel command list recording with a worker pool
std::vector<CommandList*> lists(bucketCount);
parallel_for(0, bucketCount, [&](int i) {
    auto* cmd = AcquireCommandList();
    RecordBucket(cmd, scene.buckets[i]);
    cmd->Close();
    lists[i] = cmd;
});

// Submit in order on the render thread
device->ExecuteCommandLists(lists.size(), lists.data());

Watch for synchronization costs. If your scene data is mutated while recording threads read it, you will get crashes or corruption. Snapshot the data at the start of recording or use a double-buffered transform hierarchy so the simulation can proceed without blocking the render threads.

Batch Draw Calls for CPU Submission Time

Draw call batching is often framed as a GPU optimization, but on modern hardware it is mostly a CPU optimization. Each DrawIndexed costs driver validation, constant buffer binding, and descriptor table updates. Reducing the number of calls from 5,000 to 500 can save 3–5 ms of CPU submission time on low-end hardware, even if the GPU does the exact same work.

Use indirect draws and bindless resources where supported. ExecuteIndirect in D3D12 and vkCmdDrawIndirect in Vulkan let the GPU read the draw parameters from a buffer, which means you submit one command that produces many draws. Combined with a bindless texture heap, you can render an entire scene with a handful of CPU-side calls.

Isolate Texture Streaming on a Copy Queue

Texture streaming is a common source of stalls that do not show up in simple profiling. When the renderer samples a texture whose target mip has not finished uploading, the driver will either stall the graphics queue waiting for the copy or silently fall back to a lower mip (causing visual pops). Neither is ideal. Use a dedicated copy queue for streaming uploads and use a fence to gate the first frame where the new mip is sampled.

// Upload on copy queue, signal a fence, consume on graphics queue
copyQueue->CopyTextureRegion(streamingTex, stagingBuffer);
copyQueue->Signal(uploadFence, ++uploadFenceValue);

// Graphics queue waits before the first sample next frame
graphicsQueue->Wait(uploadFence, uploadFenceValue);
graphicsQueue->ExecuteCommandLists(...);

Pre-warm streaming for known-important textures. When the player enters a new region, kick off uploads for textures in that region one second before they are visible. A second of lead time is usually enough to hide the upload behind gameplay that does not yet need the new assets.

“We chased a 40 ms spike in our shipped build for two weeks. It turned out to be a resource barrier on a shadow atlas that flushed the entire pipeline every time the sun rotated through a new cascade boundary. RenderDoc showed the stall immediately — we just had never captured the right frame.”

Related Issues

For related GPU work, see how to debug streaming hitches in open-world games. For CPU-side analysis, read how to debug garbage collection spikes in your game.

Next time you see a frame spike, capture it with CPU/GPU correlation enabled. The gap between the timelines will tell you which side is waiting.