The Harness Is The Product

· 15 min read

The traditional mental model:

Product → Tests (validate the product)

The inversion that’s actually happening:

Harness → Product (the product is a projection of the harness)

The harness isn’t a safety net. The harness is the core product. The shipped artifact is a snapshot.

This essay is about why that inversion matters, and what it means for how we build software.


Part I: The Blindness Problem

AI coding assistants are blind to runtime.

They can read your code. They can reason about structure. They can suggest changes, refactor functions, add features. But they cannot see:

  • What’s logging to the console right now
  • What network requests are failing
  • What errors are being thrown
  • What the user is actually clicking on

Debugging becomes a game of copy-paste telephone. “Can you paste the error message?” “What does the console show?” “What’s the network tab say?”

The AI is powerful but blind. You are sighted but slow. The bottleneck isn’t intelligence—it’s information transfer.

The Obvious Solution

Capture everything. Make runtime visible. Give the AI eyes.

This is what observability tools do. Logging. Tracing. Metrics. APM dashboards. The enterprise has been doing this for decades.

But enterprise observability is built for production monitoring. It’s designed to answer: “Is the system healthy? Where are the bottlenecks? What’s the error rate?”

That’s not the question we need to answer.

The question is: “What just happened in my development environment so the AI can help me fix it?”

Different question. Different tool.

The Five-Minute Firehose

Here’s what actually works: capture everything, keep it for five minutes, make it queryable.

No schema. No filtering at capture time. No decisions about what matters. Just swallow the firehose and let the query layer figure it out.

# AI runs this and instantly sees your app's state
curl http://localhost:6274/recent?_type=console.error

The AI can now see what you see. The information transfer bottleneck dissolves. You describe the problem, the AI queries the state, you iterate together.

This is table stakes. This is the minimum viable observability for AI-assisted development.

But it’s not the interesting part.


Part II: The Playwright Problem

Once you have observability, you start thinking about testing.

The standard answer is Playwright. Browser automation. Click buttons, fill forms, assert on DOM state.

await page.click('[data-testid="submit-button"]')
await expect(page.locator('.success-message')).toBeVisible()

This works for CRUD apps with simple interactions. Click button, see message. Fill form, submit, check result.

It doesn’t work for:

Complex spatial UIs. Graph editors. Node-based workflows. Draggable windows. Floating panels. The DOM structure is dynamic. Coordinates aren’t stable. “Click the third node” doesn’t mean anything when nodes can be anywhere.

Non-deterministic outputs. LLM in the loop. The response varies every time. You can’t assert “this exact text appears” because the text is different on every run.

Behavior over appearance. You don’t actually care if the success message has the right CSS class. You care that the order was created in the backend. Playwright tests the surface, not the behavior.

Most interesting software being built today doesn’t fit the Playwright model. And the more AI-assisted development accelerates, the more we build things that don’t fit.

What Can You Actually Assert?

When outputs are non-deterministic and UIs are complex, boolean assertions break down.

assert(result === expected) — but what’s “expected” when the output is probabilistic?

The answer: you can’t assert on values. You assert on events.

  • “An LLM response event arrived”
  • “The response was rendered to the canvas”
  • “No error events fired”
  • “The state update contained the required fields”

You’re not testing “did this exact thing happen.” You’re testing “did the system behave correctly at the seam level.”

The seams are stable even when the outputs aren’t. The event flow is deterministic even when the content isn’t.


Part III: From Boolean to Inference

Here’s the shift that matters:

Traditional testing: assert(x === y) — true or false, pass or fail

What we actually need: “Given everything that happened, does this look right?”

The first is a boolean. The second is inference.

When you have an LLM in the loop, you can’t write:

expect(response.text).toBe("Hello, world!")

But you CAN write:

expect(response).toSatisfy(r => 
  hasNoErrors(r) && 
  containsGreeting(r) && 
  responseTimeUnder(r, 2000)
)

Or better—you feed the entire event stream to another model and ask: “Does this run look correct given the intent?”

The “test” isn’t code anymore. The test is judgment. And judgment can be delegated to inference.

The Observability-Testing Continuum

This reveals something interesting. Observability and testing aren’t separate activities. They’re points on a continuum:

Pure Observation — “What happened?” Passive, after-the-fact. Human judges correctness.

Guided Observation — “What happened, filtered by what I care about?” Still passive, but focused.

Replay — “What happened, and can we do it again?” Capture inputs, replay deterministically.

Constraint Checking — “Did anything violate known invariants?” Automated but still pattern-matching.

Inference — “Does this run smell correct given the intent?” Judgment, not assertion.

Traditional testing lives at “Constraint Checking.” Playwright lives there. Unit tests live there.

But AI-assisted development needs to operate further right on the spectrum. When outputs are probabilistic, when UIs are spatial, when behavior matters more than appearance—you need inference, not assertion.

And inference needs data. The more you observe, the better your inference. The firehose feeds the judgment.


Part IV: The Harness Inversion

Now we can see the inversion clearly.

The old model: You build a product. Then you write tests. The tests validate the product. If tests pass, the product is correct.

The new model: You build a harness. The harness captures everything. The harness applies constraints (hard and soft). The harness feeds inference. The product is whatever survives the harness.

The harness isn’t downstream from the product. The product is downstream from the harness.

This matters because:

1. The harness compounds. Every constraint you add makes future development faster. Every pattern the inference learns makes judgment better. The harness gets smarter over time. The product is rebuilt constantly, but the harness accumulates.

2. The harness is the competitive advantage. Two teams with equal talent—the one with the better harness ships faster, with fewer bugs, with more confidence. The harness is the force multiplier.

3. The harness enables AI velocity. The AI can iterate fearlessly when the harness catches mistakes. Without a harness, every AI suggestion needs human verification. With a harness, the human verifies the constraints, and the AI operates freely within them.

What Lives in the Harness?

  • Observability capture. Everything that happens, queryable.
  • Invariant checks. Hard constraints that must not be violated.
  • Inference models. Soft constraints that judge “does this seem right?”
  • Replay capability. Reproduce any session, any state.
  • Feedback loops. Human flags something wrong → harness learns.

The harness is the institutional knowledge of what “correct” means for your system. It’s executable understanding. It’s the thing that lets you move fast without breaking things—not because you’re careful, but because the harness is watching.


Part V: The Examples Compound Too

The harness compounds positively. But there’s another thing in your repo that compounds, and it’s the one you’re least watching: the canonical examples.

AI doesn’t generate code from first principles. It pattern-matches off the code already in your repo—especially your tests, since tests are usually the cleanest, most isolated, most labeled examples of “what good looks like” in your specific codebase. Every time the AI writes new code, it pulls shapes from existing tests. Every time it writes new tests, it pulls shapes from existing tests. The test suite isn’t just documentation. It’s the AI’s training data for your project.

So the test suite has its own compounding slope. And nothing in the harness watches it.

Positive direction: clean canonical examples seed clean next-round generation. The codebase keeps looking like itself because the AI keeps copying the parts that look right.

Negative direction: one dubious shape slips in. Next round, the AI pattern-matches off it. Now there are two. Next round, four. Every commit, you feel like you’re moving forward, but mostly you’re moving sideways while the surface of low-quality examples multiplies underneath you. The harness checks “did the right thing happen at runtime.” It does not check “is the source code shaped in a way that future generation can productively learn from.”

This is most acute when you’re designing the spec while implementing it. Greenfield languages. Internal DSLs. New paradigms with no established idiom. Anything where there is no external “look how it’s supposed to look.” In those projects, the in-repo examples ARE the spec, and the AI cannot reach outside them. If you let the example surface decay, you are decaying the spec.

The only forcing function that breaks the loop is making dubious shapes illegal. Push the rule into the compiler, the parser, the type checker, the linter—wherever it can refuse the bad shape and crash the stale example loudly. Stale examples can no longer silently rot. They get rewritten in idiomatic form, or they get deleted. The canonical surface gets cleaner. Next-round generation pattern-matches off cleaner examples. The slope flips.

The harness watches behavior. You also have to watch the examples. If the AI is your primary author, you are shipping two artifacts every commit: the product, and the example surface that authors the next commit. Both compound. Only one is currently visible.


Part VI: The Transparency Feedback Loop

Here’s where it connects back to observability.

When you capture everything, you can see patterns. When you see patterns, you can encode them as constraints. When you encode constraints, you catch violations. When you catch violations, you learn what else to capture.

Observe → Pattern → Constraint → Violation → Observe more

This is the transparency feedback loop. Observability isn’t just about debugging. It’s about discovering what “correct” means for your system.

Every bug you find through observation becomes a constraint you can check. Every constraint you check reveals edge cases to observe. The harness grows organically from the act of watching.

The AI’s Role

The AI can participate in every step:

  • Observe: Query the event stream, notice anomalies
  • Pattern: “I’ve seen this sequence before, it usually means X”
  • Constraint: “Should I add a check for this invariant?”
  • Violation: “This run violated constraint Y, here’s why”
  • Learn: “Based on this session, I think we should also capture Z”

The AI isn’t just coding. The AI is building the harness alongside you. The harness is the shared artifact. The product is almost incidental—it’s what falls out of the harness.


Part VII: Implications

Testing Changes

Stop thinking about “test coverage.” Start thinking about “harness completeness.”

Coverage asks: “What percentage of code paths are exercised?” Completeness asks: “What percentage of ‘correct’ is encoded?”

You can have 100% coverage and still ship bugs—the assertions were wrong. You can have low coverage but a complete harness—the inference catches what the assertions miss.

The metric isn’t lines covered. The metric is: “If something goes wrong, will the harness notice?”

Tooling Changes

The tools we need don’t exist yet. We have:

  • Observability tools — built for production monitoring, not development iteration
  • Testing frameworks — built for boolean assertions, not inference
  • AI coding assistants — blind to runtime

We need:

  • Development firehoses — capture everything, five-minute retention, instant query
  • Inference harnesses — feed event streams to models, get judgments
  • Constraint builders — turn observations into checks automatically

These tools are primitive today. They’re emerging.

Culture Changes

“Write tests” becomes “build the harness.” “Debug the failure” becomes “query the stream.” “Manual QA” becomes “inference review.” “Ship and pray” becomes “ship with harness.”

The teams that internalize this will outship teams that don’t. Not because they’re smarter—because they’re building the right artifact.


Part VIII: The Sidetrack Example

This isn’t theory. We’re building it.

Sidetrack is a development observability tool. HTTP server, in-memory SQLite, five-minute retention. Captures console, network, errors, async transitions, DOM events. Queryable via curl.

curl http://localhost:6274/recent?_type=console.error
curl http://localhost:6274/search?q=failed

It’s primitive. It’s a lunchbreak tool. It gives the AI eyes.

The input surface isn’t fixed. Anywhere you have a stream of dev-environment events worth asking questions about, you can pipe it in. A tail -F adapter against your local server’s test logs makes every log line an event with implicit structure — timestamp, level, source, message — indexable alongside your console errors and your network failures. Mainstream logging systems already do rule-based handling of log entries; you don’t need to replace any of that. You just need an adapter that funnels the relevant subset in. The same shape applies to user feedback, bug reports, the thing your QA person flagged in Slack. In a sufficiently AI-paced team, the friction of “go check Jira, then Sentry, then the logs, then come back and translate” is itself the bottleneck — the AI can’t see those surfaces, so the human becomes a copy-paste relay. Moving those signals closer to the harness collapses the relay.

To be clear about what isn’t being claimed: the harness is not coming for your logging stack. It is not coming for your APM. It is not coming for the observability your architecture already gives you in production. Those tools solve a different problem — system health, error rates, where bottlenecks live. The harness asks a different question: “what just happened in dev so the AI can help me fix it?” Some of the same data is useful for that question. A tail adapter into sidetrack doesn’t replace Datadog; it makes Datadog-shaped data legible to your dev loop.

But it’s also the seed of a harness. Once you have the event stream, you can:

  • Add sidetrack await — block until an event matches a query
  • Add constraint checks — “fail if any uncaught exception”
  • Add inference hooks — “does this session look correct?”
  • Add feedback capture — “human flagged this as wrong, learn from it”

The observability tool becomes the test harness becomes the inference engine. The lines blur because they were always the same thing: encoding what “correct” means.


Conclusion

This might sound abstract. “Harness-first development.” “Inference over assertion.” “The product is downstream.”

But the shift is already happening. AI-assisted development is exposing the limits of boolean testing. Complex UIs are exposing the limits of Playwright. Non-deterministic outputs are exposing the limits of expected-value assertions.

The response is to build harnesses. Encode correctness. Let products be projections of what the harness validates.

The harness is the product. Build accordingly.


This essay is part of a series on AI-assisted development. See also: “Future Work Mythology” (what’s dying), “The Toolmaker’s Discipline” (what to build), and “The LLM Whisperer” (how to collaborate).