Quick answer: Tune batch size so each batch takes 30–100 µs. Avoid false sharing by padding per-thread data or writing per-thread to separate arrays. Profile to confirm worker threads are saturated.

Here is how to fix Unity IJobParallelFor that does not scale beyond 2–3 cores even on a 16-core CPU. Batch size and false sharing are the two most common culprits.

The Symptom

Job runs at 5 ms on 4 worker threads. Bumping JobsUtility.JobWorkerCount to 8 produces 4.8 ms; 16 threads gives 4.7 ms. Diminishing returns despite plenty of CPU available.

What Causes This

Batch size too small. Per-batch scheduling cost dominates work. Tiny batches mean threads spend more time waiting for work than doing it.

False sharing. Multiple threads writing to nearby memory invalidate each other’s cache lines.

Memory bandwidth bound. Job is moving more data than the memory subsystem can supply; more cores cannot help.

The Fix

Step 1: Tune batch size.

// Test different batch sizes
JobHandle h = job.Schedule(arr.Length, batchSize: 64);   // try 16, 64, 256, 1024

Aim for batch durations of 30–100 µs. Profile with Unity Profiler’s Job preview track.

Step 2: Avoid false sharing.

// BAD: per-thread counters in adjacent ints
NativeArray<int> counters = new NativeArray<int>(threadCount, Allocator.TempJob);

// BETTER: pad to cache line
[StructLayout(LayoutKind.Sequential, Size = 64)]
struct PaddedInt { public int Value; }
NativeArray<PaddedInt> counters = new NativeArray<PaddedInt>(threadCount, Allocator.TempJob);

Each PaddedInt now occupies its own cache line; no thread interferes with another’s writes.

Step 3: Profile job timeline. Window → Analysis → Profiler. Switch to Jobs view. Look for narrow stacked bars (good parallelism) vs wide single-thread bars (poor parallelism).

Step 4: Reduce shared writes. Use NativeQueue or NativeStream to accumulate per-thread results without contention.

Step 5: For memory-bound work, work on smaller chunks per access. Tile large data so each batch fits in L2 cache.

“Right batch size + no false sharing + memory-aware. Then scaling kicks in.”

Related Issues

For Burst compile errors, see Burst Compile. For NativeArray dispose, see Handle Dispose.

30-100us batches. Padded per-thread state. Profile saturation. Cores actually help.