Quick answer: Measure CPU frame time and GPU frame time separately. The larger number is your bottleneck. Ship this as telemetry with device info so you can see per-hardware patterns. A player with a 4090 whose frames are CPU-bound is a very different problem from a player with a 1060 whose frames are GPU-bound.

“It’s slow” isn’t a bug report you can act on. “Frames are averaging 38 ms, of which 32 ms is CPU work on the render thread, on a Steam Deck” is. The distance between those two reports is the work of instrumenting your frame. Without it, you’ll spend weeks optimizing the GPU path for players whose bottleneck is actually your scripting layer.

Three Times Per Frame

There are three numbers worth tracking each frame:

  1. CPU frame time — wall-clock time from the start of your main loop to the point where you finish submitting all GPU commands for this frame.
  2. GPU frame time — the GPU’s own reported duration, captured via timestamp queries around the frame’s commands.
  3. GPU idle time — the span where the GPU finished its queue and sat waiting for the CPU to submit more work.

From those three, the bottleneck is trivial to classify:

Measuring CPU Time

Capture Stopwatch values at the start and end of your frame. In Unity, hook into Application.onBeforeRender for the start and Camera.onPostRender (or a submit-complete callback) for the end. In Unreal, use FSlateApplication::OnPreTick and FSlateApplication::OnPostTick. In a custom engine, bracket your main loop manually.

Split CPU time into main-thread and render-thread buckets. The main thread runs scripts, physics, and animation; the render thread submits draw calls. Knowing which thread is the bottleneck changes what you optimize. A high render-thread number points at too many draw calls; a high main-thread number points at gameplay code.

Measuring GPU Time

GPU time requires timestamp queries. Every graphics API provides them: ID3D12GraphicsCommandList::EndQuery in Direct3D 12, vkCmdWriteTimestamp in Vulkan, MTLCommandBuffer.GPUStartTime and GPUEndTime in Metal. The flow is:

  1. Write a timestamp at the start of the frame’s commands.
  2. Write a timestamp at the end.
  3. On the next frame (timestamps are available one frame late), read both values and compute the delta.
// Pseudo-code for per-frame GPU timing
var beginIdx = frameIndex * 2;
var endIdx = beginIdx + 1;

commandList.EndQuery(queryHeap, TIMESTAMP, beginIdx);
// ... frame's rendering commands ...
commandList.EndQuery(queryHeap, TIMESTAMP, endIdx);

// Two frames later, results are ready:
var gpuStart = timestamps[beginIdx];
var gpuEnd = timestamps[endIdx];
var gpuMs = (gpuEnd - gpuStart) * 1000.0 / gpuFrequency;

Queued Frames and Latency

Present-to-display latency depends on how many frames the driver keeps queued. The default is 1–3. A deep queue absorbs CPU stalls but increases input lag. Competitive games usually force the queue to 1. Single-player games benefit from 2–3 for smoother frame pacing.

Measure the queue depth by comparing the timestamp of the frame’s present call to the time the image actually appears (on Windows, IDXGISwapChain2::GetFrameLatencyWaitableObject; on consoles, the platform flip API gives you scan-out time). A queue depth of 3 with 16.67 ms frames means 50 ms of latency before you see the first pixel of a player’s click.

Ship It as Telemetry

Don’t just log locally. Emit a compact telemetry record once per minute or once per level load, with the following fields:

Group the results by GPU model and quality preset. You’ll see patterns — Intel HD Graphics is always GPU-bound, M1 Max is CPU-bound on your main thread because it runs so fast the GPU never waits. Each cluster is a distinct optimization target.

Don’t Trust a Single Frame

Any single frame can spike due to GC, a shader compile, or a streaming hitch. Classify “bound” state only over a rolling window of 60 or more frames. The on-screen HUD should show the classification averaged over 1 second; the telemetry should log percentiles over a longer window.

Acting on the Data

Once you know where frames are bound, the optimizations are well-known:

A player who is CPU-bound will see no improvement from lowering shadow quality. A player who is GPU-bound will see no improvement from disabling physics. Targeting the right advice to the right player is the payoff for measuring this accurately.

“Frame rate is a symptom. CPU time, GPU time, and the gap between them are the diagnosis.”

Related Issues

For deeper profiling workflows, see how to profile Unreal shipping builds. For turning this data into alerts, see performance regression detection for games.

Optimize the bottleneck. Optimizing anything else is rearranging deck chairs on a frame-rate-bound ship.