Idiomatic Koru Kernels Match Hand-Specialized C

· 6 min read

Idiomatic Koru Kernels Match Hand-Specialized C

We wanted Koru kernels to land in the same ballpark as idiomatic C, Rust, and Zig.

The result was stronger than that.

Our fused n-body kernel, written in straightforward Koru kernel style, came in faster than the plain reference implementations. Every implementation here is “naive” — the obvious, idiomatic version a competent programmer would write in each language. No tricks, no hand-tuning, no -ffast-math:

ImplementationRelative
Koru (kernel fused)1.00
C1.14
Zig1.14
Rust1.17
SBCL (kernel DSL)1.88
SBCL (Common Lisp)2.03
GHC (Haskell)2.58

50 million iterations. Warmup 3. Multiple runs via hyperfine. Same machine. Same benchmark harness. All source code is available on GitHub.

A note on numbers: these are ballpark figures. Exact ratios shift slightly between benchmark runs depending on system load, thermal state, and scheduling. The relative ordering is stable — we’ve run this many times — but don’t read too much into the second decimal place.

That was already interesting. But it raised the obvious question: is Koru really producing unusually strong code here, or is the reference C just not shaped the way the optimizer wants?

So we followed up the honest way.

The Follow-Up

We wrote a fixed-size scalarized C version. No generic array-of-struct loop. No “for any N” shape. Just the exact five-body problem written in a form that mirrors what a compiler would like to see after aggressive specialization.

We compiled it twice: once with -ffast-math, once without. Here’s the full picture with the optimized variants included:

ImplementationRelative
C (scalarized, no fast-math)1.00
Koru (kernel fused)1.00
C (scalarized, fast-math)1.00
C1.12
Zig1.14
Rust1.15
SBCL (kernel DSL)1.88
SBCL (Common Lisp)2.03
GHC (Haskell)2.58

The hand-specialized C closed the gap completely — and barely edged ahead.

That does not weaken the result. It sharpens it.

Why This Is Stronger, Not Weaker

The point was never “C is slow.” C is obviously not slow. And we would be worried if we couldn’t get hand-optimized C to run faster — that would mean the benchmark was broken, not that Koru was magic.

The point is that Koru kernel code is carrying enough semantic information that the compiler can lower it into code that is within 1% of expert hand-specialized C, without forcing the programmer to write the hand-specialized C in the first place.

That is a much stronger statement than “Koru beats C on one benchmark.”

The plain C reference already had restrict in the hot path. That means the programmer was already supplying aliasing knowledge manually. In other words, the C version was already asking the human to act as part-time optimizer.

This was the original hot loop shape:

static inline void advance(struct body * restrict b, double dt) {
    for (int i = 0; i < 5; i++) {
        for (int j = i + 1; j < 5; j++) {
            double dx = b[i].x - b[j].x;
            double dy = b[i].y - b[j].y;
            double dz = b[i].z - b[j].z;
            double dsq = dx * dx + dy * dy + dz * dz;
            double mag = dt / (dsq * sqrt(dsq));

            b[i].vx -= dx * b[j].mass * mag;
            b[i].vy -= dy * b[j].mass * mag;
            b[i].vz -= dz * b[j].mass * mag;
            b[j].vx += dx * b[i].mass * mag;
            b[j].vy += dy * b[i].mass * mag;
            b[j].vz += dz * b[i].mass * mag;
        }
    }

    for (int i = 0; i < 5; i++) {
        b[i].x += dt * b[i].vx;
        b[i].y += dt * b[i].vy;
        b[i].z += dt * b[i].vz;
    }
}

That is already good C. It is not naive. It uses restrict. It uses fixed loop bounds. It uses the standard dsq * sqrt(dsq) trick. It keeps the data in a global fixed array.

And it was still slower than the fused Koru kernel.

So what did the C version need in order to catch up?

What We Had To Do In C

We had to stop writing “normal benchmark reference C” and start writing something much closer to a manual lowering of the optimized shape.

These were the key changes:

  • scalarize each body into separate locals instead of indexing through b[i]
  • make the problem fixed-size in the source, not just in the loop bounds
  • spell out all ten pair interactions explicitly
  • keep masses as separate constants
  • update positions in one fused straight-line block

This is the kind of helper the specialized version uses:

static inline void advance_pair(
    double xi, double yi, double zi,
    double xj, double yj, double zj,
    double mi, double mj,
    double *restrict vxi, double *restrict vyi, double *restrict vzi,
    double *restrict vxj, double *restrict vyj, double *restrict vzj
) {
    const double dx = xi - xj;
    const double dy = yi - yj;
    const double dz = zi - zj;
    const double dsq = dx * dx + dy * dy + dz * dz;
    const double mag = DT / (dsq * sqrt(dsq));

    *vxi -= dx * mj * mag;
    *vyi -= dy * mj * mag;
    *vzi -= dz * mj * mag;
    *vxj += dx * mi * mag;
    *vyj += dy * mi * mag;
    *vzj += dz * mi * mag;
}

And then the timestep becomes this:

advance_pair(x0, y0, z0, x1, y1, z1, m0, m1, &vx0, &vy0, &vz0, &vx1, &vy1, &vz1);
advance_pair(x0, y0, z0, x2, y2, z2, m0, m2, &vx0, &vy0, &vz0, &vx2, &vy2, &vz2);
advance_pair(x0, y0, z0, x3, y3, z3, m0, m3, &vx0, &vy0, &vz0, &vx3, &vy3, &vz3);
advance_pair(x0, y0, z0, x4, y4, z4, m0, m4, &vx0, &vy0, &vz0, &vx4, &vy4, &vz4);
advance_pair(x1, y1, z1, x2, y2, z2, m1, m2, &vx1, &vy1, &vz1, &vx2, &vy2, &vz2);
advance_pair(x1, y1, z1, x3, y3, z3, m1, m3, &vx1, &vy1, &vz1, &vx3, &vy3, &vz3);
advance_pair(x1, y1, z1, x4, y4, z4, m1, m4, &vx1, &vy1, &vz1, &vx4, &vy4, &vz4);
advance_pair(x2, y2, z2, x3, y3, z3, m2, m3, &vx2, &vy2, &vz2, &vx3, &vy3, &vz3);
advance_pair(x2, y2, z2, x4, y4, z4, m2, m4, &vx2, &vy2, &vz2, &vx4, &vy4, &vz4);
advance_pair(x3, y3, z3, x4, y4, z4, m3, m4, &vx3, &vy3, &vz3, &vx4, &vy4, &vz4);

x0 += DT * vx0; y0 += DT * vy0; z0 += DT * vz0;
x1 += DT * vx1; y1 += DT * vy1; z1 += DT * vz1;
x2 += DT * vx2; y2 += DT * vy2; z2 += DT * vz2;
x3 += DT * vx3; y3 += DT * vy3; z3 += DT * vz3;
x4 += DT * vx4; y4 += DT * vy4; z4 += DT * vz4;

That is the version that matched Koru.

To be clear: this is valid, good C. But it is no longer just “write the obvious program.” It is a human taking the semantic problem and manually reshaping it into a form the optimizer can exploit more aggressively.

That is the expertise gap Koru is trying to close.

Koru kernels make those facts part of the model:

  • the data shape is explicit
  • the relationships between elements are explicit
  • pairwise interaction is explicit
  • per-element update is explicit
  • aliasing constraints are implied by the kernel abstraction itself

The user writes this:

~std.kernel:init(Body) {
    { x: 0, y: 0, z: 0, vx: 0, vy: 0, vz: 0, mass: SOLAR_MASS },
    // four more bodies...
}
| kernel k |>
    std.kernel:step(0..iterations)
    |> std.kernel:pairwise {
        const dx = k.x - k.other.x;
        const dy = k.y - k.other.y;
        const dz = k.z - k.other.z;
        const dsq = dx*dx + dy*dy + dz*dz;
        const mag = DT / (dsq * @sqrt(dsq));
        k.vx -= dx * k.other.mass * mag;
        k.vy -= dy * k.other.mass * mag;
        k.vz -= dz * k.other.mass * mag;
        k.other.vx += dx * k.mass * mag;
        k.other.vy += dy * k.mass * mag;
        k.other.vz += dz * k.mass * mag;
    }
    |> std.kernel:self {
        k.x += DT * k.vx;
        k.y += DT * k.vy;
        k.z += DT * k.vz;
    }

The matching C ends up looking like a manual lowering pass.

Ten explicit pair updates. Scalarized locals for every body. A shape that is closer to compiler IR than to the original problem statement.

That is exactly the point of Koru kernels.

The fast-math Surprise

An interesting detail: the scalarized C compiled without -ffast-math was consistently the fastest. Not by much, but consistently.

This matters because -ffast-math lets the C compiler reorder floating-point operations freely. For this particular workload, the IEEE-compliant evaluation order turned out to be better for the pipeline than whatever reordering -ffast-math chose.

Koru doesn’t use -ffast-math either. It gets its speed from structural knowledge — knowing that the computation is a fixed pairwise interaction on a known shape — not from loosening floating-point semantics.

The Full Landscape

The expanded benchmark tells a richer story than just “compiled languages are fast”:

TierLanguagesRange
Tied for firstC (scalarized), Koru1.00x
Strong compiledC, Zig, Rust1.12–1.15x
GC’d, kernel-optimizedSBCL (kernel DSL)1.88x
GC’d, native compiledSBCL (Common Lisp)2.03x
GC’d, lazyGHC (Haskell)2.58x

SBCL and GHC are both running the same algorithm with full type declarations and optimization flags. They are respectable results for languages carrying garbage collectors and full runtime images. The fact that they land at 2–2.6x rather than 10–100x is a testament to their native code compilers.

The SBCL kernel DSL entry deserves explanation — see below.

But the headline is still the top tier: Koru’s high-level kernel abstraction is within 1% of hand-scalarized C.

The Real Conclusion

The honest conclusion is not:

“Koru is faster than C.”

The honest conclusion is:

Idiomatic Koru kernel code compiles to performance within 1% of hand-specialized C on this benchmark — and it beats the plain C, Zig, and Rust references by 12–17%.

If the only way for C to match Koru is to rewrite the program into a fixed-size scalarized form, then Koru is doing something valuable. It is preserving optimization-relevant truth in the source language instead of making the programmer smuggle that truth in manually through restrict, scalarization, fixed-size pointer types, and layout tricks.

Another way to say it:

  • the original C reference is what a good systems programmer would plausibly write
  • the fixed-size scalarized C is what a performance specialist writes after staring at the problem and the optimizer for a while
  • the Koru version is much closer to the first category in source style, but lands near the second category in output speed

What We Were Hoping For

We were hoping Koru kernels would be in the same neighborhood as idiomatic systems code.

What we got was this:

  • faster than the plain C, Zig, and Rust references
  • within 1% of a hand-specialized fixed-size C rewrite
  • without making the Koru programmer write the hand-specialized version
  • without -ffast-math or any other semantic-loosening flags

That is the result.

And that is why we think Koru kernels are genuinely promising.

Someone Ported the Kernel Abstraction to Lisp

After this post went up, Robert Smith (stylewarning) built a kernel DSL in Common Lisp that mirrors the Koru kernel vocabulary almost exactly:

KoruLisp DSL
std.kernel:init(Body)define-kernel-shape body 5
std.kernel:pairwise { ... }define-pairwise-kernel
std.kernel:self { ... }define-self-kernel
std.kernel:step(0..n)define-kernel-step

The implementation is about 700 lines of Lisp macros that do compile-time unrolling, FMA rewriting, and — on x86-64 — hand-written SIMD vectorization using SBCL’s VOP system for packed-double vfmadd231pd instructions.

His results on an AMD Ryzen AI MAX+ PRO 395 (x86-64):

ImplementationMedian (ms)
Lisp (kernel DSL)1,651
C (gcc -O3 -ffast-math)1,657
Lisp (conventional loops)2,005

The kernel DSL matched C on his hardware. That is a real result.

On our benchmark machine (Apple Silicon, arm64), the same code lands at 1.88x behind Koru. The reason is straightforward: his SIMD backend is x86-64 only. On arm64 it falls back to scalar code with compile-time unrolling and FMA rewriting — still 8% faster than conventional SBCL, but without the vectorization that closes the gap on x86.

Koru doesn’t have this problem. The kernel compiler lowers to Zig, and LLVM auto-vectorizes for whatever platform you’re on — ARM NEON, x86 AVX, whatever. The kernel abstraction carries enough semantic information that the backend doesn’t need per-architecture SIMD code. That is the difference between building the optimization into a macro and building it into a compiler.

Both implementations create islands — in Koru, kernel:init owns its subtree so the compiler can reason about the whole computation. His macros do the same thing: define-kernel-step captures the kernel bodies and expands them inline. That’s the right design. You need a closed world to optimize aggressively.

The difference is in the backend. Koru’s kernel library is user-space code — it’s in the standard library, but it’s written with the same tools available to any library author. It lowers to Zig, and LLVM handles vectorization for the target platform. His DSL lowers to Lisp with hand-written SIMD intrinsics for one architecture.

Both approaches validate the same idea: the kernel vocabulary — pairwise, self, reduction, step — carries real optimization information. Whether that information is exploited by LLVM auto-vectorization or by hand-written SIMD macros, it works.

Kernels are islands by design — and islands are portable across languages. But Koru also has constructs like ~capture that compose with everything~for, ~if, nested captures, arbitrary pipeline stages — because Koru’s flow layer is pure dataflow. ~capture isn’t an island. It’s a transform that participates in the same continuation model as every other flow construct.

That kind of integration is harder to replicate in Lisp. You can bolt on islands — Robert Smith proved that convincingly. But writing a macro that composes transparently with every other control flow construct means fighting against a language where the default semantics are imperative. In Koru, the flow layer is declarative by default, so composition comes for free.

How This Happened

Full disclosure: this section exists because I was shitposting on X about Lisp being slow.

Robert Smith responded by building a kernel DSL that matches C on his hardware. I got completely schooled. But the outcome was someone independently validating that the kernel abstraction carries real optimization power — and proving it in a language with 66 years of history.

I have learned nothing. The shitposting produced the results. Why would I stop?