The Producer Doesn't Need To Know

The Producer Doesn’t Need To Know

The benchmark results that started this conversation:

Koru Taps:        8.5ms
Zig MPMC Ring:    88ms
Go Channels:      681ms
Rust Crossbeam:   157ms

Event taps are 90% faster than a hand-tuned lock-free MPMC ring.

Before you tweet that: this is not an apples-to-apples comparison. And that’s exactly the point.


The Honest Disclaimer

Let’s be completely transparent about what these benchmarks actually measure:

Taps VersionRing Version
Single-threadedMulti-threaded
No synchronizationLock-free atomics
Inline function callsCross-thread communication
~80 lines of Koru~230 lines of Koru+Zig

The taps version doesn’t DO what the ring version does. It’s like saying “a bicycle is faster than a truck” - true, but you can’t haul cargo on a bicycle.

So why are we even comparing them?


They Solve The Same Pattern

Both implementations solve the Producer/Consumer Observation Pattern:

  1. Something happens (producer emits values)
  2. Something else reacts (consumer accumulates them)
  3. At the end, we validate the result

The ring version assumes you need:

  • Thread separation
  • Buffering for backpressure
  • Lock-free synchronization
  • Cross-thread communication

But what if you don’t? What if your observer can run inline? What if you just want to watch what’s happening without introducing all that machinery?

The insight: We’ve been conflating “observation” with “transport” for decades.


The Revolutionary Inversion

Here’s how traditional concurrent programming works:

Producer decides → "I'll put values on a channel"
Consumer must adapt → "Okay, I'll read from that channel"
Want to change transport? → Rewrite the producer!

This couples your producer to a specific communication mechanism. Every producer has to decide: channels? rings? callbacks? message queues?

Koru taps invert this entirely:

Producer just emits → ~count() returns .next or .done
Observer decides → "I'll tap that inline / via threadpool / via channel"
Want to change transport? → Change the tap, not the producer!

Here’s what this looks like in practice:

// The producer - IDENTICAL in all cases
~event count { i: u64 }
| next { value: u64 }
| done {}

~proc count {
    if (i >= MESSAGES) return .{ .done = .{} };
    return .{ .next = .{ .value = i } };
}

// Observer Option A: Inline (what we benchmarked - zero overhead)
~count -> * | next v |> accumulate(value: v.value)

// Observer Option B: Threadpool (async processing)
~count -> * | next v |> pool.submit(work: v)

// Observer Option C: Channel (buffered, decoupled)
~count -> * | next v |> ring.enqueue(value: v.value)

The producer code is identical in all three cases. The observer makes the transport decision.


Back To The Benchmark

Now the benchmark makes sense. We compared:

  1. Taps version: Observer says “I’ll just accumulate inline”
  2. Ring version: Observer says “I need cross-thread buffering”

The taps version is faster because it’s doing less work - and that’s exactly right. When you don’t need threading, you shouldn’t pay for it.

The benchmark isn’t proving taps are “better than channels.” It’s demonstrating what happens when the observer can choose the minimum viable transport.


Observation Fidelity: Choose Your Level

Taps don’t just let you choose transport. They let you choose how much information you want to observe:

// TRANSITION: Just the facts (source, destination, branch as enums)
// Zero string allocations, maximum performance
~count -> * | Transition t |> stats.increment(event: t.source)

// PROFILE: Timing information (strings, timestamps)
// For profiling and tracing
~count -> * | Profile p |> trace.record(name: p.source, ts: p.timestamp_ns)

// AUDIT: Full payload access (complete event data)
// For logging, debugging, event sourcing
~count -> * | Audit a |> log.write(event: a.source, data: a.payload)

This creates a 2D matrix of observation strategies:

InlineThreadpoolChannel
TransitionCountersAsync metricsBuffered stats
ProfileInline profilerAsync tracingLog aggregation
AuditDebug loggingAsync auditFull event sourcing

The producer is unaware of ALL of this.


Real Example: Full Profiler in 3 Taps

Here’s profiler.kz from the Koru standard library - a complete Chrome Tracing profiler:

// Start profiling when program starts
~[opaque]tap(koru:start -> *)
| done |> write_header()
    | done |> _

// Profile EVERY event transition in the entire program
~[opaque]tap(* -> *)
| Profile p |> write_event(source: p.source, timestamp_ns: p.timestamp_ns)
    | done |> _

// End profiling when program ends
~[opaque]tap(koru:end -> *)
| done |> write_footer()
    | done |> _

Three taps. That’s the whole profiler.

The ~* -> * syntax means “tap ALL event transitions.” Every event in your entire program fires through this tap.

Usage:

~import "$std/profiler"
// Your entire app is now profiled for Chrome Tracing

The application code doesn’t know it’s being profiled. The profiler just observes.


Opting Out: [opaque]

Notice the [opaque] annotation on the profiler taps. This serves two purposes:

  1. Prevents infinite recursion - The profiler’s own events (write_header, write_event, write_footer) can’t be tapped by other observers, including itself.

  2. Privacy control - Any event can opt out of being observed:

// This event cannot be tapped by external observers
~[opaque] event internal_crypto_operation { key: []u8 }

Use cases:

  • Security-sensitive operations
  • Performance-critical hot paths
  • Internal implementation details
  • Preventing observation loops

Why Taps Are Truly Zero-Cost: AST Rewriting

This isn’t some runtime hook system. Taps rewrite your AST at compile time.

When you write:

~count -> * | next v |> accumulate(value: v.value)

The compiler’s tap transformer pass inserts the tap invocation directly into the flow’s AST:

BEFORE tap injection:
~count() | next n |> @loop(...)

AFTER tap injection:
~count() | next n |> accumulate(value: n.value) |> @loop(...)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                     COMPILER INSERTED THIS

This means taps:

  • Participate in all optimization passes (purity checking, fusion, dead code elimination)
  • Are visible to the type checker (arguments are validated)
  • Can be inlined (the optimizer sees them as regular code)
  • Are eliminated when unused (dead code elimination works normally)

Compare this to traditional observer patterns where callbacks are opaque function pointers the compiler can’t reason about.


The Generated Code

Here’s what our tap benchmark actually compiled to (from output_emitted.zig):

var loop_i: u64 = 0;
var result_1 = main_module.count_event.handler(.{ .i = loop_i });
loop: while (result_1 == .next) {
    const n = result_1.next;
    // TAP INLINED HERE - just a function call!
    const result_2 = main_module.accumulate_event.handler(.{ .value = n.value });
    _ = &result_2;
    loop_i = n.value + 1;
    result_1 = main_module.count_event.handler(.{ .i = loop_i });
    continue :loop;
}
// TAP FOR DONE BRANCH
_ = main_module.validate_event.handler(.{});

No vtables. No dispatch. No runtime registration. Just a function call that the Zig optimizer can inline further.


Greppability and Tooling

Every tap in your codebase is discoverable:

# Find all taps
grep '~.*->' **/*.kz

# Find all opaque events
grep '[opaque]' **/*.kz

# Find all taps on a specific event
grep '~count ->' **/*.kz

Because taps are AST nodes, tooling can:

  • Show “who’s watching this event?”
  • Warn about unused taps
  • Verify tap chains for correctness
  • Visualize observation topology

The Bigger Picture

The benchmark wasn’t about proving taps are “faster than channels.” It was about demonstrating a different way of thinking about producer/consumer relationships.

Traditional: Producer picks the transport. Consumer adapts. Koru: Producer emits events. Observer picks everything.

When you separate observation from transport:

  • You can start with inline taps (maximum performance)
  • Add threading only when profiling shows you need it
  • The producer code never changes
  • Observation strategies are greppable and explicit

What This Enables

  1. Progressive Optimization: Start inline, add threading only for proven hot paths
  2. Flexible Profiling: Profile everything with ~* -> *, cost nothing in release
  3. Clean Architecture: Producers focus on logic, observers handle cross-cutting concerns
  4. Security Boundaries: [opaque] for events that shouldn’t be observable

Conclusion

The headline “taps are 90% faster than rings” is technically true but misses the point.

The real story is: the producer doesn’t need to know.

It doesn’t need to know if you’re observing. It doesn’t need to know how you’re observing. It doesn’t need to know what transport you’re using. It doesn’t need to know anything about your observation strategy.

That’s not just faster. That’s a fundamentally different way to think about concurrent systems.


Published November 21, 2025

Koru: Where the observer decides, and the producer just produces.