The Regression Suite Got 22x Faster

May 20, 2026 · 6 min read

Two earlier posts argued the inversion: “The Harness Is The Product” said the shipped artifact is a snapshot of the harness, not the other way around. “The Harness Is The Spec” said the discipline of building the harness is the actual development work. This post is a worked example from one corner of that harness — the regression test suite — where a real extension landed in an afternoon at compiler-extension prices, because the compiler is shaped right.

Full regression suite, cold, eight cores: 10 minutes 54 seconds.

Same suite, warm: 30 seconds.

Same snapshot. Byte-identical. 500 passed, 27 failed, 74 todo, 3 skipped, 11 untested.

That’s the receipt. The interesting part is the ratio behind it: an afternoon of work in the test runner, twenty-five lines of compiler change.

Part I: The cache

A regression test in Koru is a directory. It contains input.kz, the expected output, and a set of markers that tell the runner what’s supposed to happen — MUST_RUN, MUST_COMPILE_KZ, MUST_FAIL, and so on. Running the test means compiling input.kz through Koru’s four-stage pipeline (Koru frontend → Zig → backend binary → Zig → final binary), running it, comparing output to expectation.

That takes several seconds per passing test. With six hundred-something tests and eight cores, that’s the eleven-minute cold run.

The cache’s job is to skip tests whose inputs haven’t changed since the last run. Inputs here means: the test’s input.kz, its expected output, its marker files, the compiler source code, every transitively-imported module, and any koru.json in the test directory or its parents.

On each test run, the runner records a fingerprint — a sorted text file of key:mtime lines — at .cache-fingerprint inside the test directory. On the next run, it re-stats each path and compares mtimes. If everything matches, the test is skipped and the previous outcome (pass or fail) is reported as a cache hit.

The warm suite output looks like this:

💾 CACHED ✓ 010_001_hello_world
💾 CACHED ✓ 010_002_simple_event
...
💾 CACHED ✗ 110_013_module_event_globbing (output)
...
RESULTS: 500 passed, 27 failed

Five hundred cached-passes, twenty-seven cached-fails, thirty seconds.

Part II: Why the compiler change was 25 lines

The hard part of any incremental build system is transitive dependency tracking. If io.kz imports fmt.kz, and fmt.kz changes, every test that imports io.kz must invalidate.

Koru’s compiler already walks transitive imports — it has to, in order to compile anything. There’s a work queue in koruc that dequeues each import, parses it, scans for more imports, queues those. It accumulates a hashmap of canonical paths called imported_paths.

The cache needs that hashmap.

So we added a flag:

// --list-imports: emit transitive resolved canonical paths as a JSON array, then exit.
if (list_imports_mode) {
    try printStdout(allocator, "[", .{});
    var first_import = true;
    var path_iter = imported_paths.keyIterator();
    while (path_iter.next()) |path_ptr| {
        if (!first_import) try printStdout(allocator, ",", .{});
        first_import = false;
        try printStdout(allocator, "\"{s}\"", .{path_ptr.*});
    }
    try printStdout(allocator, "]\n", .{});
    return;
}

Twenty-five lines including the flag declaration, the help text update, and the dump.

$ koruc --list-imports input.kz
["/.../koru_std/io.kz","/.../koru_std/compiler.kz","/.../base.kz","/.../helper.kz"]

The cache stats each path and stores the mtimes. When any of them changes, the fingerprint mismatches, the cache invalidates, the test re-runs.

Part III: What this would cost in another compiler

In most language implementations, getting at the transitive import set from outside the compiler is not a twenty-five line change. You either:

Reimplement the import resolver in your harness, in some scripting language, hoping it matches the compiler’s semantics. (Brittle. Drifts.)
Fork the compiler to add the flag — now you’re vendoring it.
File an RFC and wait.

The reason it’s cheap in Koru is structural. The compiler is a Zig program with a clean entry point. Adding a flag means editing one file, recompiling, done. No compiler team has to bless the change. No plugin architecture has to be threaded through. The compiler is a tool we own end-to-end, and tools we own end-to-end can grow whatever surfaces our workflow needs.

This is the “you are the compiler” tenet at the build-system layer. The downstream effect generalizes well beyond the regression suite: tooling that needs compiler internals lands at compiler-extension prices. Things that would be ergonomically impossible in a sealed-binary compiler — “give me the transitive deps as JSON,” “run only the parse stage,” “emit the AST as data” — are afternoon work.

The Harness Is The Product asked: what lives in the harness? Observability, invariants, inference, replay. Underneath all of those sits a category that gets less attention: the surfaces the compiler exposes for harness tools to consume. Those surfaces aren’t free in most language ecosystems. They’re roughly free here. The regression cache is one small worked example of what that buys.

Part IV: Why the cache trusts itself

The first version of the cache hedged. It required the previous test outcome to be a pass (the SUCCESS marker on disk) before reporting a hit.

The incoherence surfaced almost immediately: if the fingerprint matches, the inputs are byte-identical, so the outcome is deterministic. The SUCCESS check is either redundant or it’s compensating for the fingerprint missing some input. Pick one.

The fix is to audit the input set until it’s complete, then trust the cache. We added the MUST_* markers. We made every fixed-name input emit a line unconditionally (with mtime=0 when absent) so that adding a marker also invalidates the cache, not just modifying it. We confirmed that --verify-cache — a mode that runs each cache hit uncached and asserts the outcome matches — caught no violations.

The general principle: a cache that hedges against itself is hiding an incomplete input set. Defensive checks layered on top of the cache are evidence that you don’t trust your own enumeration of what makes the outcome.

This is also why the cache caches both passes and failures, by the same argument. If inputs are byte-identical, a failing test will fail the same way. There’s no “fresh feedback” required because the feedback is on disk — compile_kz.err, backend.err, the actual output. They’re left in place when a test fails. The cache just spares you the seconds of re-running the compile.

Part V: Receipts

Twenty-two-x is the warm number. The cold run is still about eleven minutes — that work hasn’t moved. What moved is the cost of re-running after small changes, which is the cost that actually shapes how often you run the suite during development.

Snapshot integrity: cold and warm full-suite runs produce byte-identical snapshots. The --verify-cache mode was tested against an artificially sabotaged cache (broke an input, then hand-aged the fingerprint to claim freshness) and correctly fired 🚨 PARITY VIOLATION. Per-test fingerprints live in each test’s directory as plain text; the format is human-readable and bash can read it without jq.

The cache is opt-in (--cache). It does not change the cold path or the default behavior of ./run_regression.sh. The default will flip once we trust it more.

The point of the post is not the speedup. The speedup is what made it visible.

The point is that this kind of work — high-leverage tooling extension, real correctness guarantees, an afternoon of effort — happened because the compiler is a program our test runner could ask things of. In another stack, the same change is a vendoring problem or an RFC. Here, it’s an edit and a recompile.

The compiler is a program. The cache is a library. The harness is the product.