~capture vs foldl': When Pure Abstractions Match Raw Performance

· 8 min read

~capture vs foldl’: When Pure Abstractions Match Raw Performance

We had a hypothesis: Koru’s zero-cost abstractions really are zero-cost. Not “almost zero.” Not “negligible overhead.” Actually zero.

So we set up a benchmark against the gold standard of functional accumulation: Haskell’s foldl'.


The Benchmark: Sum 100 Million Numbers

The simplest possible fold - sum the numbers 1 to 100,000,000. If there’s any overhead in our abstraction, it’ll show up here. A hundred million iterations makes even nanoseconds visible.

Haskell with foldl’

import Data.List (foldl')

n :: Int
n = 100000000

main = putStrLn $ "Haskell foldl' sum: " ++ show (foldl' (+) 0 [1..n])

This is the strict left fold. Not lazy foldl (which would blow the stack), but foldl' - the one you’re supposed to use. Compiled with -O2.

Koru with ~capture + ~for

~capture({ sum: @as(i64, 0) })
| as acc |> for(0..N)
    | each i |> captured { sum: acc.sum + @as(i64, @intCast(i)) + 1 }
| captured result |> std.io:print.ln("${result.sum:d}")

This is the clean syntax. ~capture initializes state, ~for iterates, captured updates the accumulator on each iteration, and | captured |> receives the final value.

Hand-Written Zig (Control)

const N: usize = 100_000_000;

pub fn main() void {
    var sum: i64 = 0;
    for (0..N) |i| {
        sum += @as(i64, @intCast(i)) + 1;
    }
    std.debug.print("{d}\n", .{sum});
}

This is what we want to match. The simplest possible loop, no abstraction whatsoever.


The Results

Running each version 5 times, taking the median:

VersionTime
Haskell foldl'74.1 ms
Koru (runtime N)17.0 ms
Hand-written Zig17.3 ms

Let that sink in.

Koru matches hand-written Zig within measurement noise. The event-driven, pure-looking abstraction compiles down to the same code as a mutable loop written by hand.

Koru is 4.4x faster than Haskell. And this is Haskell with its best foot forward - strict fold, -O2 optimization, a problem that should be perfect for GHC to optimize.


The Comptime Surprise

We originally had N as a compile-time constant:

const N: usize = 100_000_000;

~capture({ sum: @as(i64, 0) })
| as acc |> for(0..N)
    ...

The first run completed in 1.3 ms. We thought we’d made a mistake.

We hadn’t. When Zig sees a loop with a constant bound and a pure computation, it evaluates the entire thing at compile time. The binary doesn’t contain a loop at all - it contains the constant 5000000050000000.

This is why we created a runtime N version for fair comparison (reading N from command-line arguments). But the comptime folding is a feature, not a bug. When your bounds ARE known at compile time, Zig will compute the answer for you.


What’s Actually Generated

Here’s what Koru’s ~capture + ~for compiles to:

const __CaptureT = comptime blk: {
    // Type magic: convert comptime fields to runtime fields
    const info = @typeInfo(@TypeOf(.{ .sum = @as(i64, 0) }));
    var fields: [info.@"struct".fields.len]std.builtin.Type.StructField = undefined;
    for (info.@"struct".fields, 0..) |f, i| {
        fields[i] = .{
            .name = f.name,
            .type = f.type,
            .default_value_ptr = null,
            .is_comptime = false,
            .alignment = f.alignment,
        };
    }
    break :blk @Type(.{ .@"struct" = .{
        .layout = .auto,
        .fields = &fields,
        .decls = &.{},
        .is_tuple = false,
    }});
};

var acc: __CaptureT = .{ .sum = @as(i64, 0) };
for (0..N) |i| {
    acc = .{ .sum = acc.sum + @as(i64, @intCast(i)) + 1 };
}
const result = acc;
std.debug.print("{d}\n", .{result.sum});

That __CaptureT block is pure comptime metaprogramming - it creates a struct type where fields are mutable at runtime. At runtime, the code is just var acc, a loop, and assignments. Exactly what you’d write by hand.


Why ~capture is More General Than foldl

Here’s the insight that makes this design special.

Haskell’s foldl' is beautiful, but it’s ONE pattern:

foldl' :: (b -> a -> b) -> b -> [a] -> b

You get an accumulator, a combining function, and a list. The fold runs to completion.

Koru’s ~capture is MORE GENERAL because it decouples state from iteration:

~capture({ sum: @as(i64, 0), count: @as(i32, 0) })
| as acc |> for(items)
    | each item |> if(item > threshold)
        | then |> captured { sum: acc.sum + item, count: acc.count + 1 }
        | else |> captured { sum: acc.sum, count: acc.count }
| captured result |> compute_average(result.sum, result.count)

The capture flows through:

  • Any iteration (~for, ~while, manual recursion)
  • Any conditionals (~if, pattern matching)
  • Any nesting (captures inside captures, each addressed by binding name)

You learn ONE concept and it works everywhere. Not “foldl for lists, scanl for intermediate values, mapAccumL for paired outputs” - just ~capture.


The Bigger Picture

This benchmark proves what we’ve been saying: Koru’s abstractions compile away completely.

When you write:

~capture({ sum: 0 })
| as acc |> for(items)
    | each i |> captured { sum: acc.sum + i }
| captured result |> use(result.sum)

You’re not paying for:

  • Event dispatch overhead (there are no events at runtime)
  • Continuation allocation (continuations are compile-time routing)
  • Abstraction layers (the abstraction IS the generated code)

You’re getting the exact same binary as if you’d written a mutable loop by hand. But the SOURCE is pure dataflow - no mutation visible, state flows forward through continuations.

This is what zero-cost means. Not “low cost.” Not “acceptable cost.” The same assembly instructions you’d get writing C.


A Note on Fairness

The 4.4x difference is real, but let’s be honest about what we’re comparing.

The Haskell code is idiomatic - the way most Haskell programmers would write it. foldl' (+) 0 [1..n] is the “correct” strict fold. But it’s not maximally optimized Haskell.

The overhead here comes from:

  • List allocation: [1..n] creates a lazy list that gets consumed (even with fusion rules, there’s overhead)
  • Function call overhead: Each iteration calls the combining function
  • GC pressure: The list spine needs to be collected

With more aggressive optimization (manual loops, unboxed types, LLVM backend), Haskell can get closer. For complex accumulators with strict data types and bang patterns, Haskell gets within 1.5-3x of native code. See our follow-up benchmark for the full analysis.

The point isn’t “Haskell is slow.” GHC is an engineering marvel. The point is: in Koru, the obvious code IS the fast code. You don’t need to learn an “optimization dialect” - strictness annotations, UNPACK pragmas, manual unboxing. The first thing you write compiles to optimal code.