The Harness Is The Spec

May 12, 2026 · 12 min read

“The Harness Is The Product” argued the inversion: in AI-assisted development, the shipped artifact is downstream of the harness, not validated by it. That essay treated the harness as a noun — observability, constraints, inference, replay, feedback loops.

This essay is about the verb. The discipline that produces the harness. Why it has to happen in parallel with the feature, not before it. Why it isn’t TDD even though it looks like a cousin. And why, in the AI era, this discipline stops being a niche specialty and becomes the gating skill of whether the rest of the product can exist at all.

The compressed version: ship the harness so the product can be birthed into it. Per feature, in parallel, every time.

Part I: The Discipline, Not Just The Noun

The previous essay made the harness sound like infrastructure. Something you build, something that sits there, something that catches problems. That framing is half right.

The other half is that building the harness is itself a creative, specialized act. The harness doesn’t fall out of the product. It doesn’t get auto-generated from a spec. It is produced by someone who knows what “correct” means for this system — what to capture, what to assert, what to leave to inference, what feedback to surface and where.

That work is harness engineering. It is not a tooling exercise, it is not test writing, and it is not adjacent to development. In the inverted model, it is development.

Most of the reason this looks invisible is historical accident. For 30 years, harnesses lived downstream of products, which meant they were chosen after the product was scoped. The tester glued Playwright onto the surface, glued Jest onto the units, glued mocks onto the network. The tools were off-the-shelf because the product was the dominant cost and the harness was a thin layer over it.

That arrangement hid how hard test design actually is.

The inversion exposes the hardness. When the harness is the primary artifact, the tools become components of a bespoke surface, not the surface itself. Playwright doesn’t die — it becomes one element inside something larger that someone designed deliberately, for this product, knowing what “correct” looks like in this domain.

That knowledge has always lived in good testers. The downstream model wasted it.

Part II: Ship The Harness, Birth The Product

The phrase “harness-first” implies a phase. Build the harness, then build the product. That framing is wrong, and it’s a trap that buys premature scaffolding for features nobody is going to write.

The correct discipline is harness-in-parallel, per feature.

Concretely: for each new capability you want the product to have, you ship the harness extension first — the observability hooks, the constraints, the test cases, the feedback target — and then you let the product feature be birthed into the now-extended harness. The harness extension and the product feature are written in lockstep. Often by the same person. Sometimes by AI working against the harness as a constraint surface.

This is not the same as TDD, though the surface resemblance is what causes most of the confusion.

TDD is a workflow within “test then code.” It assumes a mental model of the requirement exists in someone’s head, and the test is the act of encoding that model into an executable check. The test follows the requirement. The code follows the test. The mental model is upstream of all of it.

Harness engineering rejects that premise. The requirement does not exist anywhere except in the harness. The act of designing the harness is the act of doing requirements engineering. There is no separate model that the harness is encoding — the harness is the model, made executable, observable, and inferable.

That’s a category difference, not a scope difference. TDD is “write the test for this one function before the function.” Harness engineering is “design the constraint surface that will make this feature possible to define at all.”

Part III: The Harness Is The Spec

The cleanest version of the claim:

The harness is the spec. No other spec exists.

This is not an oversight. It is the architectural decision. A separate design document would be a second source of truth, which means it would drift the instant it is written, which means it would lie quietly while the system keeps moving. The only consistent arrangement is the one where the executable artifact and the specification artifact are the same artifact.

We do this in Koru. Koru has no language design document that lives outside its tests. The test suite is the language definition. If let x = 1 is not in the tests, it does not exist in Koru. We do not have a separate “what the language is supposed to do” file that the tests validate against. The tests don’t validate Koru. The tests are Koru.

A programming language is the cleanest possible example of this discipline because a language cannot exist as an approximation. The semantics either are or aren’t. There is no “mostly correct” parsing. There is no “approximately right” type system. Every operator, every keyword, every error message either has executable definition or it doesn’t. The forcing function is total.

But the same logic applies to any system whose correctness can’t be summarized in a sentence. The further you get from CRUD, the more your “design document” is just a fiction held together by social agreement among the people who happened to be in the meeting. The harness is the only form of correctness that doesn’t drift.

Part IV: The Four-Role Collapse

When the harness is the spec, it stops being one artifact playing one role. It collapses several roles that used to be separate into one queryable surface.

For Koru, the same test suite is simultaneously:

The executable specification. The definition of what the language is.
The teaching material. Every test is a worked example, and the site renders them as browsable lessons with prose, categories, and ordering.
The live audit surface. Each lesson shows status: success | failed | broken | todo so a reader sees whether the language currently does what the page claims it does.
The feedback target. A sidebar on every page lets visitors leave feedback on the exact lesson they are reading, not in a separate issue tracker.

Four roles. One artifact. The off-the-shelf-tools approach cannot produce this because none of those tools were ever asked to.

The collapse matters because each of these roles, in the federated arrangement, is a separate place where information drifts. The spec lies. The docs lie differently. The audit dashboard lies in a third way. The feedback collected in Jira lives in a fourth context, severed from the surface it was about. Every additional artifact is another place that has to be kept in sync, and the sync work doesn’t get done because nobody is paid to do it.

One bespoke artifact playing four roles eliminates four sync problems simultaneously.

Part V: Drift Becomes Loud

This is the positive case for the inversion. Up to here, the argument has been defensive — AI velocity forces this, complexity forces this, the alternative is impossible. But the harness-as-spec arrangement is also better on a dimension we’ve been quietly losing on for three decades.

Documentation drift was always invisible. A wrong document still reads correct. It still looks authoritative. The user trusting it does not know they are being misled. The failure mode of stale prose documentation is silent.

Tests-as-spec change the failure mode. A broken test is red. If the documentation is wrong, the documentation fails visibly, in the same surface the user is reading. The fix is forced into the same commit as the behavior change because no commit that breaks a test passes CI. The drift cost moves from “compounds silently for months” to “stops merge in five minutes.”

This was already worth doing in the pre-AI era. The reason it didn’t get done is that prose documentation is cheap to write and tests-as-spec are expensive to design, so the project economics pushed everyone toward the silent-drift arrangement.

AI changes those economics in two directions at once. Tests-as-spec are now cheaper to design (AI can help). And prose documentation drift just became actively produced at scale, because AI is excellent at generating plausible-looking documentation that does not match the code. Doc drift used to be neglect. It is now production output.

There is no defense against AI-generated drift except an executable counter-surface that audits itself in front of the reader. Tests-as-spec is the only arrangement that survives the AI era with its truth claims intact.

Part VI: The Diagnostic

If the argument so far is right, then most teams attempting AI-coded products without harness-in-parallel discipline are accumulating a particular kind of debt — and they cannot see it.

Here is the line, sharpened for diagnosis:

Compounding costs from not knowing in which direction the AI is coding things.

That is the failure mode. It is not slow shipping. It is not visible bugs. It is directional blindness. The AI is producing output. The dashboard says velocity is up. The PRs are merging. From the outside, the project looks healthy. But somewhere underneath, the AI is making choices about what the system means — what shippable looks like, which invariants to preserve, which corners to round off — and nobody is checking those choices against any executable definition of correct, because no such definition exists outside the AI’s head and the AI does not have a head.

The debt compounds invisibly. Months in, you discover the system does not behave the way anyone thought it did, and there is no spec to compare it against, because the spec was supposed to be the tests, and the tests were never written in parallel.

That is the failure mode harness-in-parallel prevents. Not by being more careful. By making direction a first-class measurable quantity. If the test for “feature X should behave like Y” is in the harness before AI starts work, then any output that violates Y is red at merge time. Direction is bounded. The AI runs fast inside the bounds because the bounds are explicit.

Without the bounds, AI velocity is not a feature. It is the failure mode.

Part VII: Past A Complexity Threshold

Honesty matters. Not every AI-coded product needs this discipline.

A simple CRUD app, a one-off script, a glue layer between two well-understood APIs — these can be AI-coded with very little harness and they will turn out fine. The complexity is low enough that “correct” is roughly visible to the human reviewer, and any drift will surface within minutes of running the thing.

The complexity threshold is real, and the talk and the post should both acknowledge it. The claim is not “every line of AI-generated code needs a harness.” The claim is past a complexity threshold, harness-in-parallel is the only arrangement that ships.

The threshold is roughly: any product whose correctness cannot be summarized in a paragraph. Any product with state that persists across operations. Any product with non-deterministic outputs (LLM in the loop). Any product where the cost of being subtly wrong exceeds the cost of being obviously broken — which is most production systems, and basically every system in regulated or public-sector contexts.

For those products, harness-in-parallel is the floor, not the ceiling.

Part VIII: The Tester’s Craft, Centered

There is a payoff in this story that the audience for this talk specifically needs to hear, because they will hear it nowhere else and they have been hearing the opposite for years.

The discipline of harness engineering is not an automation discipline. It is not a tooling discipline. It is a design discipline, and the people best equipped to do it are the people who have been quietly developing the relevant intuitions for decades — the testers who learned the hard way what “correct” looks like in a real system, what fails silently, what fails loudly, what edges to push on, what the model in the user’s head differs from the model in the code.

That craft has been undervalued because it lived downstream. The product was the prestige artifact. The test suite was the cleanup crew.

The inversion moves that craft to the center. Not as promotion. Not as career-coaching reassurance. As architectural fact. The thing the tester knows how to do — design a constraint surface that captures what correct means in this domain — is the thing the rest of the product now depends on existing.

The teams that internalize this will move faster than teams that don’t. Not because their testers got better. Because they noticed that their testers were already the people best positioned to do the highest-leverage work in the AI era, and they reorganized around that fact.

The teams that do not internalize it will keep treating harness work as overhead, will keep grabbing Playwright off the shelf, will keep generating plausible-looking documentation that does not match the code, and will keep wondering why their AI velocity charts look great while their systems quietly stop meaning what anyone thought they meant.

Conclusion

The previous essay said the harness is the product. This one sharpens that: the harness is the spec, the documentation, the audit, the feedback target, and the only durable artifact in the system. The product is what gets birthed into it.

The discipline that produces the harness is its own craft. It is not TDD. It is not Playwright. It is not test automation. It is the act of deciding what correct means for this specific system, encoded as something executable, observable, and inferable — designed in parallel with each feature, never downstream, never as a phase, never as an afterthought.

In the AI era, that discipline is no longer optional. Past a real complexity threshold, products that do not invest in it cannot ship — not because they will ship slower, but because they will ship something nobody can verify is the thing they meant to build.

Ship the harness. Birth the product. Then do it again, next feature, in parallel.

This essay is the second in a thread that started with “The Harness Is The Product” and connects directly to “Tests as Living Documentation”, which describes the Koru implementation of the four-role collapse in concrete terms.