Quick answer: The GPU runs async. After dispatch, call rd.submit() then rd.sync() before buffer_get_data — otherwise you read pre-dispatch data.
A custom compute shader does a parallel reduction. Reading the result buffer returns the input values — the read happened before the GPU finished.
Submit, Then Sync
var compute_list = rd.compute_list_begin()
rd.compute_list_bind_compute_pipeline(compute_list, pipeline)
rd.compute_list_bind_uniform_set(compute_list, uniform_set, 0)
rd.compute_list_dispatch(compute_list, groups_x, 1, 1)
rd.compute_list_end()
rd.submit()
rd.sync() # block until GPU done
var output = rd.buffer_get_data(output_buffer)
submit() sends the work; sync() blocks the CPU until the GPU finishes. Only then is buffer_get_data valid.
The Cost of Sync
sync() stalls the CPU. For per-frame compute, this serializes CPU and GPU — bad for performance. Mitigations:
- Dispatch this frame, sync + read next frame (1-frame latency, no stall).
- Use a double-buffered result so the CPU reads frame N-1 while the GPU computes frame N.
Local vs Global RenderingDevice
Use a local RenderingDevice (RenderingServer.create_local_rendering_device()) for standalone compute. The main RD is busy rendering; a local one isolates your compute submit/sync.
Verifying
Compute result matches the expected reduction. With the 1-frame-delayed read pattern, the CPU never stalls and the frame rate holds.
“The GPU is async. submit + sync makes the read valid — but design around the stall.”
For physics or simulation compute, the 1-frame-latency double-buffer pattern is standard — players never notice one frame, and you never stall.