Quick answer: Build a labeled library of anonymized real sessions (negative class) and known bot sessions + synthetic inputs (positive class). Run every rule change offline against the library, measure TP/FP rates, and gate releases on the numbers.

You change one threshold in your bot detection rule to catch more cheaters. A week later, innocent players are being flagged. You revert. Your bot detection can’t move forward because every tweak has the chance to ban paying customers. A test harness lets you iterate on rules without putting players at risk.

The Data Set

Two classes, both anonymized:

Aim for at least 100 of each class. Every false positive that gets overturned in production is added to the negative set; every new bot technique caught becomes a positive fixture.

The Harness

def evaluate_rule(rule, library):
    tp = fp = tn = fn = 0
    for session in library:
        flagged = rule.evaluate(session)
        if flagged and session.label == "bot": tp += 1
        elif flagged and session.label == "legit": fp += 1
        elif not flagged and session.label == "bot": fn += 1
        else: tn += 1
    return {"tp": tp, "fp": fp, "fn": fn, "tn": tn,
            "precision": tp / (tp + fp or 1),
            "recall": tp / (tp + fn or 1)}

CI Gating

Every PR that touches detection rules triggers the harness. The CI job reports precision and recall. Baseline precision is what currently ships. A change that drops precision below baseline fails the build. Recall is a secondary metric — a change that lowers recall but keeps precision high is acceptable if it’s part of a deliberate tightening pass.

Synthetic Data Generation

Build a generator that produces bot-like inputs against your game’s input protocol:

Each class gets 20–50 synthetic sessions so rule changes can be tested without waiting for real bots to appear.

Feedback Loop

Production-flagged sessions that get overturned on appeal are critical: they are false positives your harness didn’t catch. Every overturn adds a session to the legit library. The harness grows smarter over time.

“Bot detection is a classifier. Classifiers need test data. Without a harness, you’re tuning thresholds by feel, and the cost of miscalibration is banned customers.”

Related Issues

For broader cheat detection work, see how to debug cheat detection false positives. For detecting modded clients, see how to detect modded clients.

Every appeal overturn is a gift — save it into the library. Your next rule change inherits the lesson.