Idiomatic Koru Kernels Match Hand-Specialized C

April 3, 2026 · 6 min read

Idiomatic Koru Kernels Match Hand-Specialized C

This is the kernel:

~std/kernel:init(Body) {
    { x: 0, y: 0, z: 0, vx: 0, vy: 0, vz: 0, mass: SOLAR_MASS },
    // four more bodies...
}
| kernel k |>
    std/kernel:step(0..iterations)
    | step |>
        std/kernel:pairwise {
            const dx = k.x - k.other.x;
            const dy = k.y - k.other.y;
            const dz = k.z - k.other.z;
            const dsq = dx*dx + dy*dy + dz*dz;
            const mag = DT / (dsq * @sqrt(dsq));
            k.vx -= dx * k.other.mass * mag;
            k.vy -= dy * k.other.mass * mag;
            k.vz -= dz * k.other.mass * mag;
            k.other.vx += dx * k.mass * mag;
            k.other.vy += dy * k.mass * mag;
            k.other.vz += dz * k.mass * mag;
        }
        |> std/kernel:self {
            k.x += DT * k.vx;
            k.y += DT * k.vy;
            k.z += DT * k.vz;
        }

Twenty lines. Gravitational n-body simulation. 50 million iterations.

It matches hand-specialized C.

The Results

Implementation	Relative
C (scalarized, no fast-math)	1.00
Koru (kernel fused)	1.00
C (scalarized, fast-math)	1.00
C	1.12
Zig	1.14
Rust	1.15
SBCL (kernel DSL)	1.88
SBCL (Common Lisp)	2.03
GHC (Haskell)	2.58

50 million iterations. Warmup 3. Multiple runs via hyperfine. Same machine. Same benchmark harness. All source code is available on GitHub.

A note on numbers: these are ballpark figures. Exact ratios shift slightly between benchmark runs depending on system load, thermal state, and scheduling. The relative ordering is stable — we’ve run this many times — but don’t read too much into the second decimal place.

The Koru kernel beats the plain C, Zig, and Rust references by 12–15%. The only thing that matches it is hand-scalarized C — a version that required manual restructuring into a form closer to compiler IR than to the original problem statement.

For the head-to-head against Rust specifically — and how the kernel abstraction earns the win — see This Is the Whole Inner Loop.

To be clear about what this is not claiming: the Zig row is idiomatic Zig, and since Koru lowers to Zig, the same scalarized form is expressible in hand-written Zig too — you could write the 1.00 version by hand in any of these languages. The point isn’t that Zig (or C) can’t reach this speed. It’s that Koru reaches it from idiomatic source, without the programmer manually restructuring the problem into compiler-IR shape.

What You Write vs What C Needs

The Koru code above declares the facts:

the data shape is explicit (kernel:init(Body))
pairwise interaction is explicit (kernel:pairwise)
per-element update is explicit (kernel:self)
iteration count is explicit (kernel:step)
aliasing constraints are implied by the abstraction itself

The compiler handles the rest.

To match Koru’s performance, the C programmer has to do this manually:

advance_pair(x0, y0, z0, x1, y1, z1, m0, m1, &vx0, &vy0, &vz0, &vx1, &vy1, &vz1);
advance_pair(x0, y0, z0, x2, y2, z2, m0, m2, &vx0, &vy0, &vz0, &vx2, &vy2, &vz2);
advance_pair(x0, y0, z0, x3, y3, z3, m0, m3, &vx0, &vy0, &vz0, &vx3, &vy3, &vz3);
advance_pair(x0, y0, z0, x4, y4, z4, m0, m4, &vx0, &vy0, &vz0, &vx4, &vy4, &vz4);
advance_pair(x1, y1, z1, x2, y2, z2, m1, m2, &vx1, &vy1, &vz1, &vx2, &vy2, &vz2);
advance_pair(x1, y1, z1, x3, y3, z3, m1, m3, &vx1, &vy1, &vz1, &vx3, &vy3, &vz3);
advance_pair(x1, y1, z1, x4, y4, z4, m1, m4, &vx1, &vy1, &vz1, &vx4, &vy4, &vz4);
advance_pair(x2, y2, z2, x3, y3, z3, m2, m3, &vx2, &vy2, &vz2, &vx3, &vy3, &vz3);
advance_pair(x2, y2, z2, x4, y4, z4, m2, m4, &vx2, &vy2, &vz2, &vx4, &vy4, &vz4);
advance_pair(x3, y3, z3, x4, y4, z4, m3, m4, &vx3, &vy3, &vz3, &vx4, &vy4, &vz4);

x0 += DT * vx0; y0 += DT * vy0; z0 += DT * vz0;
x1 += DT * vx1; y1 += DT * vy1; z1 += DT * vz1;
x2 += DT * vx2; y2 += DT * vy2; z2 += DT * vz2;
x3 += DT * vx3; y3 += DT * vy3; z3 += DT * vz3;
x4 += DT * vx4; y4 += DT * vy4; z4 += DT * vz4;

Ten explicit pair updates. Scalarized locals for every body. All masses as separate constants. A shape that looks like a manual lowering pass.

That is the expertise gap Koru is trying to close.

The Honest Conclusion

The honest conclusion is not “Koru is faster than C.”

The honest conclusion is:

Idiomatic Koru kernel code compiles to performance within 1% of hand-specialized C — and it beats the plain C, Zig, and Rust references by 12–15%.

If the only way for C to match Koru is to rewrite the program into a fixed-size scalarized form, then Koru is doing something valuable. It preserves optimization-relevant truth in the source language instead of making the programmer smuggle that truth in manually through restrict, scalarization, fixed-size pointer types, and layout tricks.

The Full Story

We wanted Koru kernels to land in the same ballpark as idiomatic C, Rust, and Zig.

The result was stronger than that.

Our fused n-body kernel, written in straightforward Koru kernel style, came in faster than the plain reference implementations. Every implementation here is “naive” — the obvious, idiomatic version a competent programmer would write in each language. No tricks, no hand-tuning, no -ffast-math:

Implementation	Relative
Koru (kernel fused)	1.00
C	1.14
Zig	1.14
Rust	1.17
SBCL (kernel DSL)	1.88
SBCL (Common Lisp)	2.03
GHC (Haskell)	2.58

That was already interesting. But it raised the obvious question: is Koru really producing unusually strong code here, or is the reference C just not shaped the way the optimizer wants?

So we followed up the honest way.

The Follow-Up

We wrote a fixed-size scalarized C version. No generic array-of-struct loop. No “for any N” shape. Just the exact five-body problem written in a form that mirrors what a compiler would like to see after aggressive specialization.

We compiled it twice: once with -ffast-math, once without. The hand-specialized C closed the gap completely — and barely edged ahead.

That does not weaken the result. It sharpens it.

Why This Is Stronger, Not Weaker

The point was never “C is slow.” C is obviously not slow. And we would be worried if we couldn’t get hand-optimized C to run faster — that would mean the benchmark was broken, not that Koru was magic.

The point is that Koru kernel code is carrying enough semantic information that the compiler can lower it into code that is within 1% of expert hand-specialized C, without forcing the programmer to write the hand-specialized C in the first place.

That is a much stronger statement than “Koru beats C on one benchmark.”

The plain C reference already had restrict in the hot path. That means the programmer was already supplying aliasing knowledge manually. In other words, the C version was already asking the human to act as part-time optimizer.

This was the original hot loop shape:

static inline void advance(struct body * restrict b, double dt) {
    for (int i = 0; i < 5; i++) {
        for (int j = i + 1; j < 5; j++) {
            double dx = b[i].x - b[j].x;
            double dy = b[i].y - b[j].y;
            double dz = b[i].z - b[j].z;
            double dsq = dx * dx + dy * dy + dz * dz;
            double mag = dt / (dsq * sqrt(dsq));

            b[i].vx -= dx * b[j].mass * mag;
            b[i].vy -= dy * b[j].mass * mag;
            b[i].vz -= dz * b[j].mass * mag;
            b[j].vx += dx * b[i].mass * mag;
            b[j].vy += dy * b[i].mass * mag;
            b[j].vz += dz * b[i].mass * mag;
        }
    }

    for (int i = 0; i < 5; i++) {
        b[i].x += dt * b[i].vx;
        b[i].y += dt * b[i].vy;
        b[i].z += dt * b[i].vz;
    }
}

That is already good C. It is not naive. It uses restrict. It uses fixed loop bounds. It uses the standard dsq * sqrt(dsq) trick. It keeps the data in a global fixed array.

And it was still slower than the fused Koru kernel.

What We Had To Do In C

We had to stop writing “normal benchmark reference C” and start writing something much closer to a manual lowering of the optimized shape.

These were the key changes:

scalarize each body into separate locals instead of indexing through b[i]
make the problem fixed-size in the source, not just in the loop bounds
spell out all ten pair interactions explicitly
keep masses as separate constants
update positions in one fused straight-line block

This is the kind of helper the specialized version uses:

static inline void advance_pair(
    double xi, double yi, double zi,
    double xj, double yj, double zj,
    double mi, double mj,
    double *restrict vxi, double *restrict vyi, double *restrict vzi,
    double *restrict vxj, double *restrict vyj, double *restrict vzj
) {
    const double dx = xi - xj;
    const double dy = yi - yj;
    const double dz = zi - zj;
    const double dsq = dx * dx + dy * dy + dz * dz;
    const double mag = DT / (dsq * sqrt(dsq));

    *vxi -= dx * mj * mag;
    *vyi -= dy * mj * mag;
    *vzi -= dz * mj * mag;
    *vxj += dx * mi * mag;
    *vyj += dy * mi * mag;
    *vzj += dz * mi * mag;
}

To be clear: this is valid, good C. But it is no longer just “write the obvious program.” It is a human taking the semantic problem and manually reshaping it into a form the optimizer can exploit more aggressively.

The fast-math Surprise

An interesting detail: the scalarized C compiled without -ffast-math was consistently the fastest. Not by much, but consistently.

This matters because -ffast-math lets the C compiler reorder floating-point operations freely. For this particular workload, the IEEE-compliant evaluation order turned out to be better for the pipeline than whatever reordering -ffast-math chose.

Koru doesn’t use -ffast-math either. It gets its speed from structural knowledge — knowing that the computation is a fixed pairwise interaction on a known shape — not from loosening floating-point semantics.

The Full Landscape

The expanded benchmark tells a richer story than just “compiled languages are fast”:

Tier	Languages	Range
Tied for first	C (scalarized), Koru	1.00x
Strong compiled	C, Zig, Rust	1.12–1.15x
GC’d, kernel-optimized	SBCL (kernel DSL)	1.88x
GC’d, native compiled	SBCL (Common Lisp)	2.03x
GC’d, lazy	GHC (Haskell)	2.58x

SBCL and GHC are both running the same algorithm with full type declarations and optimization flags. They are respectable results for languages carrying garbage collectors and full runtime images. The fact that they land at 2–2.6x rather than 10–100x is a testament to their native code compilers.

The SBCL kernel DSL entry deserves explanation — see below.

But the headline is still the top tier: Koru’s high-level kernel abstraction is within 1% of hand-scalarized C.

What We Were Hoping For

We were hoping Koru kernels would be in the same neighborhood as idiomatic systems code.

What we got was this:

faster than the plain C, Zig, and Rust references
within 1% of a hand-specialized fixed-size C rewrite
without making the Koru programmer write the hand-specialized version
without -ffast-math or any other semantic-loosening flags

That is the result.

And that is why we think Koru kernels are genuinely promising.

Someone Ported the Kernel Abstraction to Lisp

After this post went up, Robert Smith (stylewarning) built a kernel DSL in Common Lisp that mirrors the Koru kernel vocabulary almost exactly:

Koru	Lisp DSL
`std/kernel:init(Body)`	`define-kernel-shape body 5`
`std/kernel:pairwise { ... }`	`define-pairwise-kernel`
`std/kernel:self { ... }`	`define-self-kernel`
`std/kernel:step(0..n)`	`define-kernel-step`

The implementation is about 700 lines of Lisp macros that do compile-time unrolling, FMA rewriting, and — on x86-64 — hand-written SIMD vectorization using SBCL’s VOP system for packed-double vfmadd231pd instructions.

His results on an AMD Ryzen AI MAX+ PRO 395 (x86-64):

Implementation	Median (ms)
Lisp (kernel DSL)	1,651
C (gcc -O3 -ffast-math)	1,657
Lisp (conventional loops)	2,005

The kernel DSL matched C on his hardware. That is a real result.

On our benchmark machine (Apple Silicon, arm64), the same code lands at 1.88x behind Koru. The reason is straightforward: his SIMD backend is x86-64 only. On arm64 it falls back to scalar code with compile-time unrolling and FMA rewriting — still 8% faster than conventional SBCL, but without the vectorization that closes the gap on x86.

Koru doesn’t have this problem. The kernel compiler lowers to Zig, and LLVM auto-vectorizes for whatever platform you’re on — ARM NEON, x86 AVX, whatever. The kernel abstraction carries enough semantic information that the backend doesn’t need per-architecture SIMD code. That is the difference between building the optimization into a macro and building it into a compiler.

Both implementations create islands — in Koru, kernel:init owns its subtree so the compiler can reason about the whole computation. His macros do the same thing: define-kernel-step captures the kernel bodies and expands them inline. That’s the right design. You need a closed world to optimize aggressively.

The difference is in the backend. Koru’s kernel library is user-space code — it’s in the standard library, but it’s written with the same tools available to any library author. It lowers to Zig, and LLVM handles vectorization for the target platform. His DSL lowers to Lisp with hand-written SIMD intrinsics for one architecture.

Both approaches validate the same idea: the kernel vocabulary — pairwise, self, reduction, step — carries real optimization information. Whether that information is exploited by LLVM auto-vectorization or by hand-written SIMD macros, it works.

Kernels are islands by design — and islands are portable across languages. But Koru also has constructs like ~capture that compose with everything — ~for, ~if, nested captures, arbitrary pipeline stages — because Koru’s flow layer is pure dataflow. ~capture isn’t an island. It’s a transform that participates in the same continuation model as every other flow construct.

That kind of integration is harder to replicate in Lisp. You can bolt on islands — Robert Smith proved that convincingly. But writing a macro that composes transparently with every other control flow construct means fighting against a language where the default semantics are imperative. In Koru, the flow layer is declarative by default, so composition comes for free.

How This Happened

Full disclosure: this section exists because I was shitposting on X about Lisp being slow.

Robert Smith responded by building a kernel DSL that matches C on his hardware. I got completely schooled. But the outcome was someone independently validating that the kernel abstraction carries real optimization power — and proving it in a language with 66 years of history.

I have learned nothing. The shitposting produced the results. Why would I stop?