kernel:pairwise Beats Rust by 2%, Idiomatic Zig by 55%

· 8 min read

Koru is very young. The compiler is experimental. The standard library is incomplete. But today we proved something important: the kernel abstraction delivers real performance.

The Benchmark

The n-body simulation is a classic computational benchmark. Five celestial bodies. Gravitational interactions. 50 million iterations. Every pair of bodies must interact every timestep.

This is exactly what kernel:pairwise is designed for.

The Results

LanguageTime (mean)vs Koru
Koru kernel:pairwise1.330s
Rust1.353s2% slower
Zig (hand-optimized)1.359s2% slower
Zig (idiomatic)2.061s55% slower

Koru beats Rust. Not by much - 2% - but it beats it. And it’s Rust.

Koru beats idiomatic Zig by 55%. This is the real story.

The Code

Idiomatic Zig (2.061s)

fn advance(bodies: []Body) void {
    for (bodies, 0..) |*bi, i| {
        for (bodies[i + 1..]) |*bj| {
            const dx = bi.x - bj.x;
            const dy = bi.y - bj.y;
            const dz = bi.z - bj.z;
            const dsq = dx*dx + dy*dy + dz*dz;
            const mag = DT / (dsq * @sqrt(dsq));

            bi.vx -= dx * bj.mass * mag;
            bi.vy -= dy * bj.mass * mag;
            bi.vz -= dz * bj.mass * mag;

            bj.vx += dx * bi.mass * mag;
            bj.vy += dy * bi.mass * mag;
            bj.vz += dz * bi.mass * mag;
        }
    }
}

Clean, readable, idiomatic. Uses slice iteration. 55% slower than Koru.

Hand-Optimized Zig (1.359s)

inline fn updatePair(b1: *Body, b2: *Body) void {
    const dx = b1.x - b2.x;
    const dy = b1.y - b2.y;
    const dz = b1.z - b2.z;
    const dsq = dx*dx + dy*dy + dz*dz;
    const mag = DT / (dsq * @sqrt(dsq));

    b1.vx -= dx * b2.mass * mag;
    b1.vy -= dy * b2.mass * mag;
    b1.vz -= dz * b2.mass * mag;

    b2.vx += dx * b1.mass * mag;
    b2.vy += dy * b1.mass * mag;
    b2.vz += dz * b1.mass * mag;
}

fn advance() void {
    // Separate pointers enable noalias optimization
    updatePair(&bodies[0], &bodies[1]);
    updatePair(&bodies[0], &bodies[2]);
    updatePair(&bodies[0], &bodies[3]);
    updatePair(&bodies[0], &bodies[4]);
    updatePair(&bodies[1], &bodies[2]);
    // ... all 10 pairs explicitly unrolled
}

Uses noalias through separate pointer parameters. Compile-time known array size. Inlined helper function. Matches Rust. But you have to know these tricks.

Koru kernel:pairwise (1.330s)

// Kernel shape declaration
~std.kernel:shape(Body) {
    x: f64, y: f64, z: f64,
    vx: f64, vy: f64, vz: f64,
    mass: f64,
}

// Main flow (abbreviated)
~parse_args()
| n iterations |>
    std.kernel:init(Body) {
        { x: 0, y: 0, z: 0, vx: 0, vy: 0, vz: 0, mass: SOLAR_MASS },
        { x: 4.84143144246472090e+00, ... mass: 9.54791938424326609e-04 * SOLAR_MASS },
        { x: 8.34336671824457987e+00, ... mass: 2.85885980666130812e-04 * SOLAR_MASS },
        { x: 1.28943695621391310e+01, ... mass: 4.36624404335156298e-05 * SOLAR_MASS },
        { x: 1.53796971148509165e+01, ... mass: 5.15138902046611451e-05 * SOLAR_MASS },
    }
    | kernel k |>
        offset_momentum(bodies: k.ptr[0..k.len])
        |> energy(bodies: k.ptr[0..k.len])
            | e e1 |> print_energy(e: e1)
                |> for(0..iterations)
                    | each _ |>
                        std.kernel:pairwise {
                            const dx = k.x - k.other.x;
                            const dy = k.y - k.other.y;
                            const dz = k.z - k.other.z;
                            const dsq = dx*dx + dy*dy + dz*dz;
                            const mag = DT / (dsq * @sqrt(dsq));
                            k.vx -= dx * k.other.mass * mag;
                            k.vy -= dy * k.other.mass * mag;
                            k.vz -= dz * k.other.mass * mag;
                            k.other.vx += dx * k.mass * mag;
                            k.other.vy += dy * k.mass * mag;
                            k.other.vz += dz * k.mass * mag;
                        }
                        |> advance_positions(bodies: k.ptr[0..k.len])
                    | done |> energy(bodies: k.ptr[0..k.len])
                        | e e2 |> print_energy(e: e2)

The | kernel k |> binding introduces k into scope. Inside pairwise, k refers to the current body and k.other refers to the paired body. The compiler walks the continuation tree, finds the kernel binding, and rewrites the body accordingly.

2% faster than Rust. 55% faster than idiomatic Zig.

How It Works

The kernel:pairwise transform generates optimized code automatically:

1. Static Allocation

The kernel data becomes a fixed-size array at module scope:

var kernel_data = [_]Body{ ... 5 bodies ... };

No heap allocation. Compiler knows the exact size.

2. Compile-Time Length

The element count is extracted at compile time:

const kernel_len: usize = 5;

This enables loop unrolling and bounds check elimination.

3. Noalias Pointers

The pairwise operation generates an inline helper with separate pointer parameters:

const helper = struct {
    inline fn call(noalias self: *Body, noalias other: *Body) void {
        // ... body transformation ...
    }
};

The noalias qualifier tells LLVM that self and other never alias, enabling aggressive optimization.

4. Nested Loops with Constant Bounds

for (0..kernel_len) |i| {
    for (i + 1..kernel_len) |j| {
        helper.call(&kernel_ptr[i], &kernel_ptr[j]);
    }
}

With kernel_len as a compile-time constant, LLVM can unroll when profitable and eliminate bounds checks.

The Insight

kernel:init produces fixed-size kernels in this version. This is the key insight.

When you write kernel:init(Body) { ... 5 bodies ... }, the compiler knows there are exactly 5 elements. It can:

  • Allocate statically instead of on the heap
  • Use compile-time loop bounds
  • Enable full loop unrolling
  • Apply noalias optimizations

The kernel abstraction doesn’t hide performance - it enables it by exposing compile-time information to the optimizer.

What This Means

You write declarative code (syntax still in flux):

std.kernel:init(Body) { ... }
| kernel k |>
    for(0..iterations)
    | each _ |>
        std.kernel:pairwise {
            k.vx -= dx * k.other.mass * mag;
            k.other.vx += dx * k.mass * mag;
        }

You get hand-optimized performance automatically:

  • noalias pointer separation
  • static allocation
  • compile-time loop bounds
  • full inlining

The 55% speedup over idiomatic Zig isn’t magic. It’s the compiler doing what you’d have to do manually in other languages.

Honest Caveats

Koru is young. Very young. The kernel system is experimental. There are rough edges:

  • The syntax is still being refined and will likely change
  • The length extraction uses brace-counting (fragile for edge cases)
  • Cross-kernel pairwise isn’t implemented yet
  • The binding metadata could be cleaner

But the core concept is proven: kernel abstractions can deliver hand-optimized performance while keeping code readable and maintainable.

Why This Matters

Most languages force a choice: write beautiful code and accept the performance hit, or write ugly optimized code and accept the maintenance burden.

Koru’s kernel system shows a third way: abstractions that enable optimization.

By making data relationships explicit (pairwise, self, reduce), the compiler has more information than it would with raw loops. It can generate better code than you’d write by hand.

The abstraction isn’t a cost. It’s a benefit.

Run It Yourself

The benchmark is in the Koru repository:

  • Location: tests/regression/900_EXAMPLES_SHOWCASE/910_LANGUAGE_SHOOTOUT/
  • Includes: Rust, Zig (idiomatic), Zig (hand-optimized), Koru kernel:pairwise
  • Tool: Uses hyperfine for statistical benchmarking
cd tests/regression/900_EXAMPLES_SHOWCASE/910_LANGUAGE_SHOOTOUT
hyperfine --warmup 3 --runs 10 
  "./2101_nbody/reference/nbody-rust 50000000" 
  "./2101_nbody/reference/nbody-zig 50000000" 
  "./nbody-noalias 50000000" 
  "./2101g_nbody_kernel_pairwise/a.out 50000000"

The numbers are real. The performance is real. The kernel abstraction works.