Quick answer: Build a batch-mode screenshot runner that iterates every supported language, captures PNGs of every key screen, and compares them against committed baselines with odiff or similar. Fail CI on unexpected differences above a 1–3% tolerance. Catches overflow, clipping, and RTL mirroring bugs before they ship.

Your English UI is pixel-perfect. You hand the project off for German localization, and when you see it for the first time, every button is overflowing, every label is clipped, and the character’s name has crashed into the HP bar. The bug is not in your code — it is in the assumption that UI that fits one language will fit all of them. Automated screenshot testing per locale catches these before players do.

Why This Matters

Length and shape vary wildly across languages. A few approximate ratios (text length relative to English):

A button sized for English “OK” does not fit German “Bestätigen”. A label aligned left breaks in RTL. A text box with a fixed height clips Thai diacritics. None of these are caught by unit tests; all of them are caught the moment a screenshot diff shows the wrong pixels.

Step 1: Script a Screenshot Runner

Build a test script that can run in batch mode (no interactive input) and cycle through screens and languages. The exact API depends on your engine.

// Unity batch mode screenshot runner
using System.Collections;
using UnityEngine;

public class LocalizationScreenshotRunner : MonoBehaviour
{
    private readonly string[] _languages = { "en", "de", "fr", "ru", "ja", "zh-cn", "ar", "th" };
    private readonly string[] _screens = { "main_menu", "settings", "inventory", "pause", "shop", "credits" };

    IEnumerator Start()
    {
        foreach (var lang in _languages)
        {
            LocalizationManager.SetLanguage(lang);
            yield return new WaitForSeconds(0.5f);

            foreach (var screen in _screens)
            {
                SceneManager.LoadSceneAsync(screen);
                yield return new WaitForSeconds(1.0f);
                ScreenCapture.CaptureScreenshot($"screenshots/{lang}_{screen}.png");
                yield return new WaitForSeconds(0.5f);
            }
        }
        Application.Quit(0);
    }
}

Invoke from CI:

Unity -batchmode -nographics \
  -projectPath . \
  -executeMethod LocalizationScreenshotRunner.RunFromCI \
  -logFile unity.log
# After the run, screenshots/ directory contains all the PNGs

Step 2: Commit Baselines

Run the script once. Manually review every screenshot — yes, every one. This is the only time you look at each pixel on purpose. Fix anything that is obviously wrong. When every language is acceptable, commit the screenshots directory as screenshots/baseline/.

Baselines are source of truth. Never re-commit them without a human review.

Step 3: Diff on Every Build

On every CI build, run the runner again and output to screenshots/current/. Diff each file against the baseline:

# Install odiff
npm install -g odiff-bin

# Diff every baseline against its current
mkdir -p screenshots/diff
fail=0
for f in screenshots/baseline/*.png; do
  name=$(basename "$f")
  odiff "$f" "screenshots/current/$name" "screenshots/diff/$name" \
    --threshold 0.01 \
    --antialiasing || fail=1
done
exit $fail

If anything exceeds the threshold, the CI step fails and uploads the diff images as artifacts so a human can decide whether the change is intended.

Step 4: Update Baselines Deliberately

When a UI change is intentional, update the baselines in the same commit:

# After making the UI change and running locally
cp screenshots/current/*.png screenshots/baseline/
git add screenshots/baseline/
git commit -m "Update screenshot baselines for redesigned shop"

A code reviewer looking at the PR should manually inspect the new baselines and confirm they look right. Treat baseline updates as seriously as code review.

Handling Flaky Diffs

Screenshot tests can flake for reasons that are not your fault:

Mitigate with:

The goal is “every diff flagged is worth looking at,” not “zero diffs ever.”

What to Capture

Start small. Main menu, pause menu, settings, one gameplay screen. These are the most linguistically dense and the most visited. Expand over time to:

Eight to twelve screens per language is usually enough to catch 90% of localization UI bugs.

“Localization bugs do not appear until you look. An automated screenshot diff is the cheapest way to look at every language, every build, without asking a human to stare at fifty images.”

Related Resources

For broader visual regression testing, see how to use visual snapshot testing for game UI regressions. For broader localization bugs, see game localization testing common bugs and how to track and fix localization bugs in your game.

Include pseudo-locale (replace every character with an accented version and pad by 30%) in your test matrix. It catches nearly every overflow bug without needing a real translation.