Quick answer: Divergent if-branches (conditions based on UVs, texture samples, or world position) force mobile GPUs to execute both sides of the branch for every thread group, doubling the work. Replace small divergent branches with mix(), step(), or smoothstep(), keep uniform-based branches for large cost differences, and profile the result on-device.

Your water shader looks great on desktop and Android flagships, but drops to 35 fps on a three-year-old Moto phone. You open the shader and see a dozen if-statements that handle foam, caustics, and edge fading. The culprit is almost certainly branch divergence — a property of how mobile GPUs execute shader code in lockstep. Understanding it lets you decide which branches to keep and which to rewrite as branch-free math.

The Symptom

Your Godot shader runs fast on desktop and on top-tier phones but crumbles on mid-range Android. The performance gap is larger than your vertex count or texture bandwidth would suggest. Simplifying the shader by commenting out if-blocks — even if they would have skipped the expensive work — restores frame rate. RenderDoc on desktop shows the shader finishing in microseconds; Arm Performance Studio on mobile shows it dominating the frame.

What Causes This

1. SIMT execution model. GPUs run fragment shaders in groups of 32 or 64 threads (a “warp” on NVIDIA, a “wavefront” on AMD/Mali, a “subgroup” in Vulkan terms). All threads in a group share a program counter — they execute the same instruction at the same time. A branch whose result differs across threads forces the group to execute both branches, masking results.

2. Divergent conditions. A branch is divergent when the condition evaluates differently across neighboring fragments. if (uv.x > 0.5) is divergent: fragments on the left of the mesh take one path, fragments on the right take the other, and a group that spans the boundary runs both. if (roughness < 0.1) where roughness is a uniform is not divergent — every thread agrees.

3. Mobile hardware narrower and slower. Desktop GPUs hide branch divergence cost with enormous parallelism. Mobile GPUs (Mali, Adreno, Apple) have fewer execution units and rely more on shader simplicity. Divergence that is 5% overhead on an RTX is 100% overhead on a Mali G57.

4. Compiler does not always emit true branches. Modern compilers often predicate short branches (“execute both, select result”), which is sometimes what mix() would do but with extra instructions. Long branches or branches with texture fetches remain real branches.

5. Texture sampling inside branches defeats prefetch. Mobile GPU drivers prefetch texture samples ahead of the shader. Samples inside divergent branches cannot be prefetched and incur full memory latency.

The Fix

Step 1: Classify your branches. Before rewriting anything, label each branch as divergent (per-pixel condition) or uniform (condition same for all pixels).

// Example Godot fragment shader snippet
shader_type spatial;
render_mode unshaded;

uniform float foam_threshold : hint_range(0, 1) = 0.6;
uniform bool  enable_caustics = true;        // uniform: cheap branch
uniform sampler2D flow_map;

void fragment() {
    vec4 flow = texture(flow_map, UV);

    // DIVERGENT: condition depends on per-pixel sample
    if (flow.r > foam_threshold) {
        ALBEDO = vec3(1.0);                   // foam
    } else {
        ALBEDO = vec3(0.1, 0.3, 0.6);          // water
    }

    // UNIFORM: condition is a bool uniform, same for all pixels
    if (enable_caustics) {
        ALBEDO += caustics_pattern(UV) * 0.3;
    }
}

Step 2: Replace divergent branches with branch-free math.

void fragment() {
    vec4 flow = texture(flow_map, UV);

    // Branch-free: evaluate both colors and blend
    vec3 foam_color  = vec3(1.0);
    vec3 water_color = vec3(0.1, 0.3, 0.6);

    // step() returns 0 or 1 with no divergence cost
    float mask = smoothstep(foam_threshold, foam_threshold + 0.05, flow.r);
    ALBEDO = mix(water_color, foam_color, mask);

    // Keep the uniform branch — still cheap
    if (enable_caustics) {
        ALBEDO += caustics_pattern(UV) * 0.3;
    }
}

Step 3: For large cost differences, keep the uniform branch. When the two sides of a branch differ by 10x or more, a correctly predicted uniform branch beats blending because it skips the expensive side entirely.

uniform bool high_quality = true;

void fragment() {
    vec3 color = vec3(0);
    if (high_quality) {
        color = volumetric_fog(VIEW, WORLD_POSITION);  // 40-sample raymarch
    } else {
        color = cheap_fog(VIEW, WORLD_POSITION);       // 1 lookup
    }
    ALBEDO = color;
}

// Do NOT turn this into mix(cheap, volumetric, high_quality ? 1.0 : 0.0)
// That would always pay the volumetric cost.

Step 4: Hoist texture samples out of branches. Even in uniform branches, moving sampling above the branch lets the driver prefetch.

// Worse: texture sample inside the branch
void fragment() {
    if (enable_caustics) {
        vec4 c = texture(caustics_tex, UV * 4.0);
        ALBEDO += c.rgb;
    }
}

// Better: sample always, use the value conditionally
void fragment() {
    vec4 c = texture(caustics_tex, UV * 4.0);
    float enable = float(enable_caustics);
    ALBEDO += c.rgb * enable;
}

Step 5: Profile on device. GPU behavior varies between Mali, Adreno, and Apple’s unified architecture. What helps on one may hurt on another. Use Godot’s built-in Performance.RENDER_GPU_TIME to measure.

func _process(_delta: float) -> void:
    var gpu_ms = Performance.get_monitor(Performance.RENDER_GPU_TIME)
    var cpu_ms = Performance.get_monitor(Performance.RENDER_CPU_TIME)
    $Label.text = "GPU %.1fms | CPU %.1fms" % [gpu_ms, cpu_ms]

Step 6: Use shader variants for mobile. When a desktop and mobile shader need very different tradeoffs, author two shaders and swap them based on platform. Godot’s RenderingServer lets you check the current renderer at runtime.

Why This Works

GPUs achieve their performance by trading flexibility for parallelism: every lane in a SIMT group executes the same instruction every cycle. When a branch diverges, the GPU does the only thing it can — execute both sides with some lanes masked off — which doubles instructions executed. On mobile, where each lane is already working hard to hit its frame budget, this cost is the difference between 60 and 30 fps.

mix() and step() translate to one or two ALU instructions each with zero control flow. They always execute both “branches”, but so did the divergent branch; the difference is you skip the overhead of the branch itself and the compiler is free to schedule around the result.

Uniform branches survive because GPUs detect that every thread will agree and execute only the chosen side. This is why if (uniform_bool) is the one kind of branch you should keep: it genuinely skips work.

Prefetching is the hidden win of hoisting samples. Mobile GPUs schedule texture reads far ahead of the shader’s use to hide memory latency. A sample locked inside a conditional block cannot be scheduled early, because the driver does not know if the branch will be taken.

"Branches are not free on mobile. Every divergent if doubles the work. Rewrite in math when you can."

Related Issues

For general mobile performance tuning, see Fix: Godot Low FPS on Android Mid-Range Phones. If your shaders compile on desktop but fail on mobile, check Fix: Godot Shader Compile Error Mobile Only.

Branch on uniforms, math on varyings. Your Mali GPU will thank you.