kernel:pairwise Matches Hand-Optimized Zig

The full reference code, for honesty

· 6 min read

kernel:pairwise Matches Hand-Optimized Zig

The n-body simulation is a standard benchmark for data-parallel computation. Five bodies. All unique pairs interact every timestep. 50 million iterations.

We ran it. Here are the numbers.

Results

10 runs, 50 million iterations, warmup 3.

MeanσMinMax
Zig (noalias *[5]Body)1.257s0.007s1.249s1.274s
C (clang -O3)1.259s0.017s1.243s1.293s
Koru (kernel:pairwise)1.265s0.006s1.260s1.280s
Koru (arrayed capture)1.368s0.005s1.363s1.379s

Three-way tie at the top. The 8ms gap between Koru and Zig is within C’s own run-to-run variance. Koru and Zig are also the most consistent — σ of 0.006–0.007s vs C’s 0.017s.

The arrayed capture variant is 8% behind. It accumulates velocity deltas into a [5][3]f64 array across nested loops rather than updating in-place — a different approach with a real cost.

The Reference Code

This is not a bragging post. The point of showing the reference implementations is that you can verify the comparison is fair.

C (clang -O3 -ffast-math -march=native)

From the Computer Language Benchmarks Game, contributed by Christoph Bauer. License: Revised BSD.

#include <math.h>
#include <stdio.h>
#include <stdlib.h>

#define pi 3.141592653589793
#define solar_mass (4 * pi * pi)
#define days_per_year 365.24

struct planet {
  double x, y, z;
  double vx, vy, vz;
  double mass;
};

static inline void advance(struct planet * restrict bodies, double dt)
{
  int i, j;

  for (i = 0; i < 5; i++) {
    for (j = i + 1; j < 5; j++) {
      double dx = bodies[i].x - bodies[j].x;
      double dy = bodies[i].y - bodies[j].y;
      double dz = bodies[i].z - bodies[j].z;
      double dsq = dx * dx + dy * dy + dz * dz;
      double mag = dt / (dsq * sqrt(dsq));
      bodies[i].vx -= dx * bodies[j].mass * mag;
      bodies[i].vy -= dy * bodies[j].mass * mag;
      bodies[i].vz -= dz * bodies[j].mass * mag;
      bodies[j].vx += dx * bodies[i].mass * mag;
      bodies[j].vy += dy * bodies[i].mass * mag;
      bodies[j].vz += dz * bodies[i].mass * mag;
    }
  }

  for (i = 0; i < 5; i++) {
    bodies[i].x += dt * bodies[i].vx;
    bodies[i].y += dt * bodies[i].vy;
    bodies[i].z += dt * bodies[i].vz;
  }
}

double energy(int nbodies, struct planet * bodies)
{
  double e;
  int i, j;

  e = 0.0;
  for (i = 0; i < nbodies; i++) {
    struct planet * b = &(bodies[i]);
    e += 0.5 * b->mass * (b->vx * b->vx + b->vy * b->vy + b->vz * b->vz);
    for (j = i + 1; j < nbodies; j++) {
      struct planet * b2 = &(bodies[j]);
      double dx = b->x - b2->x;
      double dy = b->y - b2->y;
      double dz = b->z - b2->z;
      double distance = sqrt(dx * dx + dy * dy + dz * dz);
      e -= (b->mass * b2->mass) / distance;
    }
  }
  return e;
}

void offset_momentum(int nbodies, struct planet * bodies)
{
  double px = 0.0, py = 0.0, pz = 0.0;
  int i;
  for (i = 0; i < nbodies; i++) {
    px += bodies[i].vx * bodies[i].mass;
    py += bodies[i].vy * bodies[i].mass;
    pz += bodies[i].vz * bodies[i].mass;
  }
  bodies[0].vx = - px / solar_mass;
  bodies[0].vy = - py / solar_mass;
  bodies[0].vz = - pz / solar_mass;
}

#define NBODIES 5
struct planet bodies[NBODIES] = {
  { 0, 0, 0, 0, 0, 0, solar_mass },
  {
    4.84143144246472090e+00,
    -1.16032004402742839e+00,
    -1.03622044471123109e-01,
    1.66007664274403694e-03 * days_per_year,
    7.69901118419740425e-03 * days_per_year,
    -6.90460016972063023e-05 * days_per_year,
    9.54791938424326609e-04 * solar_mass
  },
  {
    8.34336671824457987e+00,
    4.12479856412430479e+00,
    -4.03523417114321381e-01,
    -2.76742510726862411e-03 * days_per_year,
    4.99852801234917238e-03 * days_per_year,
    2.30417297573763929e-05 * days_per_year,
    2.85885980666130812e-04 * solar_mass
  },
  {
    1.28943695621391310e+01,
    -1.51111514016986312e+01,
    -2.23307578892655734e-01,
    2.96460137564761618e-03 * days_per_year,
    2.37847173959480950e-03 * days_per_year,
    -2.96589568540237556e-05 * days_per_year,
    4.36624404335156298e-05 * solar_mass
  },
  {
    1.53796971148509165e+01,
    -2.59193146099879641e+01,
    1.79258772950371181e-01,
    2.68067772490389322e-03 * days_per_year,
    1.62824170038242295e-03 * days_per_year,
    -9.51592254519715870e-05 * days_per_year,
    5.15138902046611451e-05 * solar_mass
  }
};

int main(int argc, char ** argv)
{
  int n = atoi(argv[1]);
  int i;

  offset_momentum(NBODIES, bodies);
  printf ("%.9f\n", energy(NBODIES, bodies));
  for (i = 1; i <= n; i++)
    advance(bodies, 0.01);
  printf ("%.9f\n", energy(NBODIES, bodies));
  return 0;
}

Zig (noalias *[5]Body, -O ReleaseFast)

Ported from the Benchmarks Game. Every optimization applied: @setFloatMode(.optimized) for fast-math, dsq * sqrt(dsq) instead of distance³ (saves two multiplications per pair), fixed *[5]Body so the compiler knows the iteration count, and noalias to match C’s restrict — telling the compiler the pointer doesn’t alias anything else in scope.

comptime {
    @setFloatMode(.optimized);
}

const std = @import("std");

const PI = 3.141592653589793;
const SOLAR_MASS = 4 * PI * PI;
const DAYS_PER_YEAR = 365.24;

const Body = struct {
    x: f64, y: f64, z: f64,
    vx: f64, vy: f64, vz: f64,
    mass: f64,
};

fn advance(noalias bodies: *[5]Body, dt: f64) void {
    for (0..5) |i| {
        for (i + 1..5) |j| {
            const dx = bodies[i].x - bodies[j].x;
            const dy = bodies[i].y - bodies[j].y;
            const dz = bodies[i].z - bodies[j].z;
            const dsq = dx * dx + dy * dy + dz * dz;
            const mag = dt / (dsq * @sqrt(dsq));

            bodies[i].vx -= dx * bodies[j].mass * mag;
            bodies[i].vy -= dy * bodies[j].mass * mag;
            bodies[i].vz -= dz * bodies[j].mass * mag;

            bodies[j].vx += dx * bodies[i].mass * mag;
            bodies[j].vy += dy * bodies[i].mass * mag;
            bodies[j].vz += dz * bodies[i].mass * mag;
        }
    }

    for (bodies) |*body| {
        body.x += dt * body.vx;
        body.y += dt * body.vy;
        body.z += dt * body.vz;
    }
}

fn energy(bodies: *const [5]Body) f64 {
    var e: f64 = 0.0;
    for (bodies, 0..) |body, i| {
        e += 0.5 * body.mass * (body.vx * body.vx + body.vy * body.vy + body.vz * body.vz);
        for (i + 1..5) |j| {
            const dx = body.x - bodies[j].x;
            const dy = body.y - bodies[j].y;
            const dz = body.z - bodies[j].z;
            e -= (body.mass * bodies[j].mass) / @sqrt(dx * dx + dy * dy + dz * dz);
        }
    }
    return e;
}

fn offsetMomentum(bodies: *[5]Body) void {
    var px: f64 = 0.0;
    var py: f64 = 0.0;
    var pz: f64 = 0.0;
    for (bodies) |body| {
        px += body.vx * body.mass;
        py += body.vy * body.mass;
        pz += body.vz * body.mass;
    }
    bodies[0].vx = -px / SOLAR_MASS;
    bodies[0].vy = -py / SOLAR_MASS;
    bodies[0].vz = -pz / SOLAR_MASS;
}

pub fn main() !void {
    const args = try std.process.argsAlloc(std.heap.page_allocator);
    defer std.process.argsFree(std.heap.page_allocator, args);
    const n = try std.fmt.parseInt(u32, args[1], 10);

    var bodies = [_]Body{
        .{ .x = 0, .y = 0, .z = 0, .vx = 0, .vy = 0, .vz = 0, .mass = SOLAR_MASS },
        .{ .x = 4.84143144246472090e+00, .y = -1.16032004402742839e+00, .z = -1.03622044471123109e-01,
           .vx = 1.66007664274403694e-03 * DAYS_PER_YEAR, .vy = 7.69901118419740425e-03 * DAYS_PER_YEAR,
           .vz = -6.90460016972063023e-05 * DAYS_PER_YEAR, .mass = 9.54791938424326609e-04 * SOLAR_MASS },
        .{ .x = 8.34336671824457987e+00, .y = 4.12479856412430479e+00, .z = -4.03523417114321381e-01,
           .vx = -2.76742510726862411e-03 * DAYS_PER_YEAR, .vy = 4.99852801234917238e-03 * DAYS_PER_YEAR,
           .vz = 2.30417297573763929e-05 * DAYS_PER_YEAR, .mass = 2.85885980666130812e-04 * SOLAR_MASS },
        .{ .x = 1.28943695621391310e+01, .y = -1.51111514016986312e+01, .z = -2.23307578892655734e-01,
           .vx = 2.96460137564761618e-03 * DAYS_PER_YEAR, .vy = 2.37847173959480950e-03 * DAYS_PER_YEAR,
           .vz = -2.96589568540237556e-05 * DAYS_PER_YEAR, .mass = 4.36624404335156298e-05 * SOLAR_MASS },
        .{ .x = 1.53796971148509165e+01, .y = -2.59193146099879641e+01, .z = 1.79258772950371181e-01,
           .vx = 2.68067772490389322e-03 * DAYS_PER_YEAR, .vy = 1.62824170038242295e-03 * DAYS_PER_YEAR,
           .vz = -9.51592254519715870e-05 * DAYS_PER_YEAR, .mass = 5.15138902046611451e-05 * SOLAR_MASS },
    };

    offsetMomentum(&bodies);
    std.debug.print("{d:.9}\n", .{energy(&bodies)});
    for (0..n) |_| advance(&bodies, 0.01);
    std.debug.print("{d:.9}\n", .{energy(&bodies)});
}

Koru (kernel:pairwise, -O ReleaseFast)

const std = @import("std");
~import "$std/kernel"
~import "$std/control"

const PI: f64 = 3.141592653589793;
const SOLAR_MASS: f64 = 4.0 * PI * PI;
const DAYS_PER_YEAR: f64 = 365.24;
const DT: f64 = 0.01;

~std.kernel:shape(Body) {
    x: f64, y: f64, z: f64,
    vx: f64, vy: f64, vz: f64,
    mass: f64,
}

~event parse_args {}
| n u32

~proc parse_args {
    const args = std.process.argsAlloc(std.heap.page_allocator) catch unreachable;
    defer std.process.argsFree(std.heap.page_allocator, args);
    if (args.len < 2) { std.debug.print("Usage: nbody <iterations>\n", .{}); std.process.exit(1); }
    return .{ .n = std.fmt.parseInt(u32, args[1], 10) catch unreachable };
}

~event offset_momentum { bodies: []Body }

~proc offset_momentum {
    var px: f64 = 0; var py: f64 = 0; var pz: f64 = 0;
    for (bodies) |b| { px += b.vx * b.mass; py += b.vy * b.mass; pz += b.vz * b.mass; }
    bodies[0].vx = -px / SOLAR_MASS; bodies[0].vy = -py / SOLAR_MASS; bodies[0].vz = -pz / SOLAR_MASS;
}

~event energy { bodies: []const Body }
| e f64

~proc energy {
    var e: f64 = 0;
    for (bodies, 0..) |b, i| {
        e += 0.5 * b.mass * (b.vx * b.vx + b.vy * b.vy + b.vz * b.vz);
        for (i + 1..5) |j| {
            const dx = b.x - bodies[j].x; const dy = b.y - bodies[j].y; const dz = b.z - bodies[j].z;
            e -= (b.mass * bodies[j].mass) / @sqrt(dx * dx + dy * dy + dz * dz);
        }
    }
    return .{ .e = e };
}

~event print_energy { e: f64 }
~proc print_energy { std.debug.print("{d:.9}\n", .{e}); }

~event advance_positions { bodies: []Body }
~proc advance_positions {
    for (bodies) |*b| { b.x += DT * b.vx; b.y += DT * b.vy; b.z += DT * b.vz; }
}

~parse_args()
| n iterations |>
    std.kernel:init(Body) {
        { x: 0, y: 0, z: 0, vx: 0, vy: 0, vz: 0, mass: SOLAR_MASS },
        { x: 4.84143144246472090e+00, y: -1.16032004402742839e+00, z: -1.03622044471123109e-01,
          vx: 1.66007664274403694e-03 * DAYS_PER_YEAR, vy: 7.69901118419740425e-03 * DAYS_PER_YEAR,
          vz: -6.90460016972063023e-05 * DAYS_PER_YEAR, mass: 9.54791938424326609e-04 * SOLAR_MASS },
        { x: 8.34336671824457987e+00, y: 4.12479856412430479e+00, z: -4.03523417114321381e-01,
          vx: -2.76742510726862411e-03 * DAYS_PER_YEAR, vy: 4.99852801234917238e-03 * DAYS_PER_YEAR,
          vz: 2.30417297573763929e-05 * DAYS_PER_YEAR, mass: 2.85885980666130812e-04 * SOLAR_MASS },
        { x: 1.28943695621391310e+01, y: -1.51111514016986312e+01, z: -2.23307578892655734e-01,
          vx: 2.96460137564761618e-03 * DAYS_PER_YEAR, vy: 2.37847173959480950e-03 * DAYS_PER_YEAR,
          vz: -2.96589568540237556e-05 * DAYS_PER_YEAR, mass: 4.36624404335156298e-05 * SOLAR_MASS },
        { x: 1.53796971148509165e+01, y: -2.59193146099879641e+01, z: 1.79258772950371181e-01,
          vx: 2.68067772490389322e-03 * DAYS_PER_YEAR, vy: 1.62824170038242295e-03 * DAYS_PER_YEAR,
          vz: -9.51592254519715870e-05 * DAYS_PER_YEAR, mass: 5.15138902046611451e-05 * SOLAR_MASS },
    }
    | kernel k |>
        offset_momentum(bodies: k.ptr[0..k.len])
        |> energy(bodies: k.ptr[0..k.len])
            | e e1 |> print_energy(e: e1)
                |> for(0..iterations)
                    | each _ |>
                        std.kernel:pairwise {
                            const dx = k.x - k.other.x;
                            const dy = k.y - k.other.y;
                            const dz = k.z - k.other.z;
                            const dsq = dx*dx + dy*dy + dz*dz;
                            const mag = DT / (dsq * @sqrt(dsq));
                            k.vx -= dx * k.other.mass * mag;
                            k.vy -= dy * k.other.mass * mag;
                            k.vz -= dz * k.other.mass * mag;
                            k.other.vx += dx * k.mass * mag;
                            k.other.vy += dy * k.mass * mag;
                            k.other.vz += dz * k.mass * mag;
                        }
                        |> advance_positions(bodies: k.ptr[0..k.len])
                    | done |> energy(bodies: k.ptr[0..k.len])
                        | e e2 |> print_energy(e: e2)

Koru (arrayed capture, -O ReleaseFast)

The arrayed capture version predates kernel:pairwise. Instead of declaring a kernel shape, it uses Koru’s capture mechanism to accumulate velocity deltas into a [5][3]f64 array across nested loops, then applies them in a single apply event. 7% slower than the kernel version — the extra indirection through the delta array is the cost.

const std = @import("std");
~import "$std/control"

const PI: f64 = 3.141592653589793;
const SOLAR_MASS: f64 = 4.0 * PI * PI;
const DAYS_PER_YEAR: f64 = 365.24;
const DT: f64 = 0.01;

const Body = struct { x: f64, y: f64, z: f64, vx: f64, vy: f64, vz: f64, mass: f64 };

const BODIES = [_]Body{
    .{ .x = 0, .y = 0, .z = 0, .vx = 0, .vy = 0, .vz = 0, .mass = SOLAR_MASS },
    .{ .x = 4.84143144246472090e+00, .y = -1.16032004402742839e+00, .z = -1.03622044471123109e-01, .vx = 1.66007664274403694e-03 * DAYS_PER_YEAR, .vy = 7.69901118419740425e-03 * DAYS_PER_YEAR, .vz = -6.90460016972063023e-05 * DAYS_PER_YEAR, .mass = 9.54791938424326609e-04 * SOLAR_MASS },
    .{ .x = 8.34336671824457987e+00, .y = 4.12479856412430479e+00, .z = -4.03523417114321381e-01, .vx = -2.76742510726862411e-03 * DAYS_PER_YEAR, .vy = 4.99852801234917238e-03 * DAYS_PER_YEAR, .vz = 2.30417297573763929e-05 * DAYS_PER_YEAR, .mass = 2.85885980666130812e-04 * SOLAR_MASS },
    .{ .x = 1.28943695621391310e+01, .y = -1.51111514016986312e+01, .z = -2.23307578892655734e-01, .vx = 2.96460137564761618e-03 * DAYS_PER_YEAR, .vy = 2.37847173959480950e-03 * DAYS_PER_YEAR, .vz = -2.96589568540237556e-05 * DAYS_PER_YEAR, .mass = 4.36624404335156298e-05 * SOLAR_MASS },
    .{ .x = 1.53796971148509165e+01, .y = -2.59193146099879641e+01, .z = 1.79258772950371181e-01, .vx = 2.68067772490389322e-03 * DAYS_PER_YEAR, .vy = 1.62824170038242295e-03 * DAYS_PER_YEAR, .vz = -9.51592254519715870e-05 * DAYS_PER_YEAR, .mass = 5.15138902046611451e-05 * SOLAR_MASS },
};

const ZERO_DV = [5][3]f64{
    .{ 0, 0, 0 }, .{ 0, 0, 0 }, .{ 0, 0, 0 }, .{ 0, 0, 0 }, .{ 0, 0, 0 },
};

~event init {}
| bodies [5]Body
~proc init { return .{ .bodies = BODIES }; }

~event offset { bodies: []Body }
~proc offset {
    var px: f64 = 0; var py: f64 = 0; var pz: f64 = 0;
    for (bodies) |b| { px += b.vx * b.mass; py += b.vy * b.mass; pz += b.vz * b.mass; }
    bodies[0].vx = -px / SOLAR_MASS; bodies[0].vy = -py / SOLAR_MASS; bodies[0].vz = -pz / SOLAR_MASS;
}

~event energy { bodies: []const Body }
| e f64
~proc energy {
    var e: f64 = 0;
    for (bodies, 0..) |b, i| {
        e += 0.5 * b.mass * (b.vx*b.vx + b.vy*b.vy + b.vz*b.vz);
        var j = i + 1;
        while (j < 5) : (j += 1) {
            const dx = b.x - bodies[j].x; const dy = b.y - bodies[j].y; const dz = b.z - bodies[j].z;
            e -= (b.mass * bodies[j].mass) / @sqrt(dx*dx + dy*dy + dz*dz);
        }
    }
    return .{ .e = e };
}

~event print_e { e: f64 }
~proc print_e { std.debug.print("{d:.9}\n", .{e}); }

~event parse {}
| n u32
~proc parse {
    const args = std.process.argsAlloc(std.heap.page_allocator) catch unreachable;
    defer std.process.argsFree(std.heap.page_allocator, args);
    if (args.len < 2) { std.debug.print("Usage: nbody <n>\n", .{}); unreachable; }
    return .{ .n = std.fmt.parseInt(u32, args[1], 10) catch unreachable };
}

~event apply { bodies: []Body, dt: f64, dv: [5][3]f64 }
~proc apply {
    for (bodies, 0..) |*b, i| {
        b.vx += dv[i][0]; b.vy += dv[i][1]; b.vz += dv[i][2];
        b.x += dt * b.vx; b.y += dt * b.vy; b.z += dt * b.vz;
    }
}

~event pair_force { bi: Body, bj: Body, dt: f64 }
| result { fx: f64, fy: f64, fz: f64, mi: f64, mj: f64 }
~proc pair_force {
    const dx = bi.x - bj.x; const dy = bi.y - bj.y; const dz = bi.z - bj.z;
    const dsq = dx*dx + dy*dy + dz*dz;
    const mag = dt / (dsq * @sqrt(dsq));
    return .{ .result = .{ .fx = dx * mag, .fy = dy * mag, .fz = dz * mag, .mi = bi.mass, .mj = bj.mass } };
}

~parse()
| n iters |> init()
    | bodies b[mutable] |> offset(bodies: b[0..])
        |> energy(bodies: b[0..])
            | e e1 |> print_e(e: e1)
                |> for(0..iters)
                    | each |> capture({ dv: ZERO_DV })
                        | as acc |> for(0..5)
                            | each i |> for(i+1..5)
                                | each j |> pair_force(bi: b[i], bj: b[j], dt: DT)
                                    | result f |> captured { dv[i][0]: acc.dv[i][0] - f.fx*f.mj, dv[i][1]: acc.dv[i][1] - f.fy*f.mj, dv[i][2]: acc.dv[i][2] - f.fz*f.mj, dv[j][0]: acc.dv[j][0] + f.fx*f.mi, dv[j][1]: acc.dv[j][1] + f.fy*f.mi, dv[j][2]: acc.dv[j][2] + f.fz*f.mi }
                                | done |> _
                            | done |> _
                        | captured deltas |> apply(bodies: b[0..], dt: DT, dv: deltas.dv)
                    | done |> energy(bodies: b[0..])
                        | e e2 |> print_e(e: e2)

What the Compiler Generates

The kernel:pairwise transform emits this for the inner loop:

const __koru_pair = struct {
    inline fn call(
        noalias __koru_pair_self: *Body,
        noalias __koru_pair_other: *Body,
    ) void {
        // user-supplied pairwise body
    }
};
for (0..N) |i| {
    for (i + 1..N) |j| {
        __koru_pair.call(&ptr[i], &ptr[j]);
    }
}

noalias is emitted unconditionally, and it is always correct. The loop structure guarantees i < j, so &ptr[i] and &ptr[j] are always distinct elements. The transform knows this — it generated the loop.

In hand-written C or Zig, noalias/restrict is something you add after convincing yourself the pointers can never alias. You have to read the loop, trace the indices, and make the argument. That is compiler work. The proof obligation belongs on the side that has the information, and pairwise has it by construction.

The dsq * @sqrt(dsq) trick is in the user code directly — the pairwise body is opaque text as far as the compiler is concerned. The user still writes the physics. The compiler handles what it actually knows.

The benchmark is in the repository at tests/regression/900_EXAMPLES_SHOWCASE/910_LANGUAGE_SHOOTOUT/. The reference binaries are in _archive/2101_nbody_optimized/reference/.