The Harness Is The Product
The traditional mental model:
Product → Tests (validate the product) The inversion that’s actually happening:
Harness → Product (the product is a projection of the harness) The harness isn’t a safety net. The harness is the core product. The shipped artifact is a snapshot.
This essay is about why that inversion matters, and what it means for how we build software.
Part I: The Blindness Problem
AI coding assistants are blind to runtime.
They can read your code. They can reason about structure. They can suggest changes, refactor functions, add features. But they cannot see:
- What’s logging to the console right now
- What network requests are failing
- What errors are being thrown
- What the user is actually clicking on
Debugging becomes a game of copy-paste telephone. “Can you paste the error message?” “What does the console show?” “What’s the network tab say?”
The AI is powerful but blind. You are sighted but slow. The bottleneck isn’t intelligence—it’s information transfer.
The Obvious Solution
Capture everything. Make runtime visible. Give the AI eyes.
This is what observability tools do. Logging. Tracing. Metrics. APM dashboards. The enterprise has been doing this for decades.
But enterprise observability is built for production monitoring. It’s designed to answer: “Is the system healthy? Where are the bottlenecks? What’s the error rate?”
That’s not the question we need to answer.
The question is: “What just happened in my development environment so the AI can help me fix it?”
Different question. Different tool.
The Five-Minute Firehose
Here’s what actually works: capture everything, keep it for five minutes, make it queryable.
No schema. No filtering at capture time. No decisions about what matters. Just swallow the firehose and let the query layer figure it out.
# AI runs this and instantly sees your app's state
curl http://localhost:6274/recent?_type=console.error The AI can now see what you see. The information transfer bottleneck dissolves. You describe the problem, the AI queries the state, you iterate together.
This is table stakes. This is the minimum viable observability for AI-assisted development.
But it’s not the interesting part.
Part II: The Playwright Problem
Once you have observability, you start thinking about testing.
The standard answer is Playwright. Browser automation. Click buttons, fill forms, assert on DOM state.
await page.click('[data-testid="submit-button"]')
await expect(page.locator('.success-message')).toBeVisible() This works for CRUD apps with simple interactions. Click button, see message. Fill form, submit, check result.
It doesn’t work for:
Complex spatial UIs. Graph editors. Node-based workflows. Draggable windows. Floating panels. The DOM structure is dynamic. Coordinates aren’t stable. “Click the third node” doesn’t mean anything when nodes can be anywhere.
Non-deterministic outputs. LLM in the loop. The response varies every time. You can’t assert “this exact text appears” because the text is different on every run.
Behavior over appearance. You don’t actually care if the success message has the right CSS class. You care that the order was created in the backend. Playwright tests the surface, not the behavior.
Most interesting software being built today doesn’t fit the Playwright model. And the more AI-assisted development accelerates, the more we build things that don’t fit.
What Can You Actually Assert?
When outputs are non-deterministic and UIs are complex, boolean assertions break down.
assert(result === expected) — but what’s “expected” when the output is probabilistic?
The answer: you can’t assert on values. You assert on events.
- “An LLM response event arrived”
- “The response was rendered to the canvas”
- “No error events fired”
- “The state update contained the required fields”
You’re not testing “did this exact thing happen.” You’re testing “did the system behave correctly at the seam level.”
The seams are stable even when the outputs aren’t. The event flow is deterministic even when the content isn’t.
Part III: From Boolean to Inference
Here’s the shift that matters:
Traditional testing: assert(x === y) — true or false, pass or fail
What we actually need: “Given everything that happened, does this look right?”
The first is a boolean. The second is inference.
When you have an LLM in the loop, you can’t write:
expect(response.text).toBe("Hello, world!") But you CAN write:
expect(response).toSatisfy(r =>
hasNoErrors(r) &&
containsGreeting(r) &&
responseTimeUnder(r, 2000)
) Or better—you feed the entire event stream to another model and ask: “Does this run look correct given the intent?”
The “test” isn’t code anymore. The test is judgment. And judgment can be delegated to inference.
The Observability-Testing Continuum
This reveals something interesting. Observability and testing aren’t separate activities. They’re points on a continuum:
Pure Observation — “What happened?” Passive, after-the-fact. Human judges correctness.
Guided Observation — “What happened, filtered by what I care about?” Still passive, but focused.
Replay — “What happened, and can we do it again?” Capture inputs, replay deterministically.
Constraint Checking — “Did anything violate known invariants?” Automated but still pattern-matching.
Inference — “Does this run smell correct given the intent?” Judgment, not assertion.
Traditional testing lives at “Constraint Checking.” Playwright lives there. Unit tests live there.
But AI-assisted development needs to operate further right on the spectrum. When outputs are probabilistic, when UIs are spatial, when behavior matters more than appearance—you need inference, not assertion.
And inference needs data. The more you observe, the better your inference. The firehose feeds the judgment.
Part IV: The Harness Inversion
Now we can see the inversion clearly.
The old model: You build a product. Then you write tests. The tests validate the product. If tests pass, the product is correct.
The new model: You build a harness. The harness captures everything. The harness applies constraints (hard and soft). The harness feeds inference. The product is whatever survives the harness.
The harness isn’t downstream from the product. The product is downstream from the harness.
This matters because:
1. The harness compounds. Every constraint you add makes future development faster. Every pattern the inference learns makes judgment better. The harness gets smarter over time. The product is rebuilt constantly, but the harness accumulates.
2. The harness is the competitive advantage. Two teams with equal talent—the one with the better harness ships faster, with fewer bugs, with more confidence. The harness is the force multiplier.
3. The harness enables AI velocity. The AI can iterate fearlessly when the harness catches mistakes. Without a harness, every AI suggestion needs human verification. With a harness, the human verifies the constraints, and the AI operates freely within them.
What Lives in the Harness?
- Observability capture. Everything that happens, queryable.
- Invariant checks. Hard constraints that must not be violated.
- Inference models. Soft constraints that judge “does this seem right?”
- Replay capability. Reproduce any session, any state.
- Feedback loops. Human flags something wrong → harness learns.
The harness is the institutional knowledge of what “correct” means for your system. It’s executable understanding. It’s the thing that lets you move fast without breaking things—not because you’re careful, but because the harness is watching.
Part V: The Transparency Feedback Loop
Here’s where it connects back to observability.
When you capture everything, you can see patterns. When you see patterns, you can encode them as constraints. When you encode constraints, you catch violations. When you catch violations, you learn what else to capture.
Observe → Pattern → Constraint → Violation → Observe more This is the transparency feedback loop. Observability isn’t just about debugging. It’s about discovering what “correct” means for your system.
Every bug you find through observation becomes a constraint you can check. Every constraint you check reveals edge cases to observe. The harness grows organically from the act of watching.
The AI’s Role
The AI can participate in every step:
- Observe: Query the event stream, notice anomalies
- Pattern: “I’ve seen this sequence before, it usually means X”
- Constraint: “Should I add a check for this invariant?”
- Violation: “This run violated constraint Y, here’s why”
- Learn: “Based on this session, I think we should also capture Z”
The AI isn’t just coding. The AI is building the harness alongside you. The harness is the shared artifact. The product is almost incidental—it’s what falls out of the harness.
Part VI: Implications
Testing Changes
Stop thinking about “test coverage.” Start thinking about “harness completeness.”
Coverage asks: “What percentage of code paths are exercised?” Completeness asks: “What percentage of ‘correct’ is encoded?”
You can have 100% coverage and still ship bugs—the assertions were wrong. You can have low coverage but a complete harness—the inference catches what the assertions miss.
The metric isn’t lines covered. The metric is: “If something goes wrong, will the harness notice?”
Tooling Changes
The tools we need don’t exist yet. We have:
- Observability tools — built for production monitoring, not development iteration
- Testing frameworks — built for boolean assertions, not inference
- AI coding assistants — blind to runtime
We need:
- Development firehoses — capture everything, five-minute retention, instant query
- Inference harnesses — feed event streams to models, get judgments
- Constraint builders — turn observations into checks automatically
These tools are primitive today. They’re emerging.
Culture Changes
“Write tests” becomes “build the harness.” “Debug the failure” becomes “query the stream.” “Manual QA” becomes “inference review.” “Ship and pray” becomes “ship with harness.”
The teams that internalize this will outship teams that don’t. Not because they’re smarter—because they’re building the right artifact.
Part VII: The Sidetrack Example
This isn’t theory. We’re building it.
Sidetrack is a development observability tool. HTTP server, in-memory SQLite, five-minute retention. Captures console, network, errors, async transitions, DOM events. Queryable via curl.
curl http://localhost:6274/recent?_type=console.error
curl http://localhost:6274/search?q=failed It’s primitive. It’s a lunchbreak tool. It gives the AI eyes.
But it’s also the seed of a harness. Once you have the event stream, you can:
- Add
sidetrack await— block until an event matches a query - Add constraint checks — “fail if any uncaught exception”
- Add inference hooks — “does this session look correct?”
- Add feedback capture — “human flagged this as wrong, learn from it”
The observability tool becomes the test harness becomes the inference engine. The lines blur because they were always the same thing: encoding what “correct” means.
Conclusion
This might sound abstract. “Harness-first development.” “Inference over assertion.” “The product is downstream.”
But the shift is already happening. AI-assisted development is exposing the limits of boolean testing. Complex UIs are exposing the limits of Playwright. Non-deterministic outputs are exposing the limits of expected-value assertions.
The response is to build harnesses. Encode correctness. Let products be projections of what the harness validates.
The harness is the product. Build accordingly.
This essay is part of a series on AI-assisted development. See also: “Future Work Mythology” (what’s dying), “The Toolmaker’s Discipline” (what to build), and “The LLM Whisperer” (how to collaborate).