Quick answer: Measure CPU frame time and GPU frame time separately. The larger number is your bottleneck. Ship this as telemetry with device info so you can see per-hardware patterns. A player with a 4090 whose frames are CPU-bound is a very different problem from a player with a 1060 whose frames are GPU-bound.
“It’s slow” isn’t a bug report you can act on. “Frames are averaging 38 ms, of which 32 ms is CPU work on the render thread, on a Steam Deck” is. The distance between those two reports is the work of instrumenting your frame. Without it, you’ll spend weeks optimizing the GPU path for players whose bottleneck is actually your scripting layer.
Three Times Per Frame
There are three numbers worth tracking each frame:
- CPU frame time — wall-clock time from the start of your main loop to the point where you finish submitting all GPU commands for this frame.
- GPU frame time — the GPU’s own reported duration, captured via timestamp queries around the frame’s commands.
- GPU idle time — the span where the GPU finished its queue and sat waiting for the CPU to submit more work.
From those three, the bottleneck is trivial to classify:
CPU > GPUandGPU idle > 0: CPU-bound. The GPU is waiting for work.GPU > CPU: GPU-bound. The CPU is waiting for the GPU to finish.CPU ≈ GPU ≈ VSync: balanced and VSync-locked. No bottleneck.CPU ≈ GPUbut both high: both are near capacity; optimize either side.
Measuring CPU Time
Capture Stopwatch values at the start and end of your frame. In Unity, hook into Application.onBeforeRender for the start and Camera.onPostRender (or a submit-complete callback) for the end. In Unreal, use FSlateApplication::OnPreTick and FSlateApplication::OnPostTick. In a custom engine, bracket your main loop manually.
Split CPU time into main-thread and render-thread buckets. The main thread runs scripts, physics, and animation; the render thread submits draw calls. Knowing which thread is the bottleneck changes what you optimize. A high render-thread number points at too many draw calls; a high main-thread number points at gameplay code.
Measuring GPU Time
GPU time requires timestamp queries. Every graphics API provides them: ID3D12GraphicsCommandList::EndQuery in Direct3D 12, vkCmdWriteTimestamp in Vulkan, MTLCommandBuffer.GPUStartTime and GPUEndTime in Metal. The flow is:
- Write a timestamp at the start of the frame’s commands.
- Write a timestamp at the end.
- On the next frame (timestamps are available one frame late), read both values and compute the delta.
// Pseudo-code for per-frame GPU timing
var beginIdx = frameIndex * 2;
var endIdx = beginIdx + 1;
commandList.EndQuery(queryHeap, TIMESTAMP, beginIdx);
// ... frame's rendering commands ...
commandList.EndQuery(queryHeap, TIMESTAMP, endIdx);
// Two frames later, results are ready:
var gpuStart = timestamps[beginIdx];
var gpuEnd = timestamps[endIdx];
var gpuMs = (gpuEnd - gpuStart) * 1000.0 / gpuFrequency;
Queued Frames and Latency
Present-to-display latency depends on how many frames the driver keeps queued. The default is 1–3. A deep queue absorbs CPU stalls but increases input lag. Competitive games usually force the queue to 1. Single-player games benefit from 2–3 for smoother frame pacing.
Measure the queue depth by comparing the timestamp of the frame’s present call to the time the image actually appears (on Windows, IDXGISwapChain2::GetFrameLatencyWaitableObject; on consoles, the platform flip API gives you scan-out time). A queue depth of 3 with 16.67 ms frames means 50 ms of latency before you see the first pixel of a player’s click.
Ship It as Telemetry
Don’t just log locally. Emit a compact telemetry record once per minute or once per level load, with the following fields:
- Average, P50, P95, P99 CPU ms.
- Average, P50, P95, P99 GPU ms.
- Percentage of frames classified as CPU-bound, GPU-bound, VSync-locked.
- GPU model, driver version, CPU model, RAM, resolution, quality preset.
Group the results by GPU model and quality preset. You’ll see patterns — Intel HD Graphics is always GPU-bound, M1 Max is CPU-bound on your main thread because it runs so fast the GPU never waits. Each cluster is a distinct optimization target.
Don’t Trust a Single Frame
Any single frame can spike due to GC, a shader compile, or a streaming hitch. Classify “bound” state only over a rolling window of 60 or more frames. The on-screen HUD should show the classification averaged over 1 second; the telemetry should log percentiles over a longer window.
Acting on the Data
Once you know where frames are bound, the optimizations are well-known:
- CPU main thread bound: profile scripts, cache component lookups, reduce
Update()calls, move work to jobs. - CPU render thread bound: reduce draw calls (GPU instancing, SRP Batcher, static batching), simplify material variants, cull earlier.
- GPU bound: reduce overdraw, simplify pixel shaders, lower shadow resolution, reduce post-processing.
A player who is CPU-bound will see no improvement from lowering shadow quality. A player who is GPU-bound will see no improvement from disabling physics. Targeting the right advice to the right player is the payoff for measuring this accurately.
“Frame rate is a symptom. CPU time, GPU time, and the gap between them are the diagnosis.”
Related Issues
For deeper profiling workflows, see how to profile Unreal shipping builds. For turning this data into alerts, see performance regression detection for games.
Optimize the bottleneck. Optimizing anything else is rearranging deck chairs on a frame-rate-bound ship.