N-Body in Pure Koru: A Complete Walkthrough
The N-body problem simulates gravitational interactions between celestial bodies. It’s a classic benchmark because it tests everything: tight loops, floating-point math, memory access patterns, and algorithmic structure.
Let’s implement it in Koru using nothing but event flows, captures, and subflows. We’ll walk through every piece, understand what it does, and see why the code is both beautiful and fast.
The Setup
const std = @import("std");
~import "$std/control"
~import "$std/io"
const PI: f64 = 3.141592653589793;
const SOLAR_MASS: f64 = 4.0 * PI * PI;
const DAYS_PER_YEAR: f64 = 365.24;
const DT: f64 = 0.01;
const Body = struct { x: f64, y: f64, z: f64, vx: f64, vy: f64, vz: f64, mass: f64 }; Standard Zig imports and constants. Body is a simple struct with position, velocity, and mass. Nothing surprising here—this is the same data layout you’d use in any language.
The ~import statements bring in Koru’s standard control flow (for, capture) and I/O (print) events.
Event Declarations: Contracts with Branches
Every computation in Koru starts with an event declaration—a contract that specifies inputs and possible outputs:
~event init {}
| bodies [5]Body This says: init takes no inputs and produces one branch called bodies containing a fixed array of 5 Body structs.
~event energy { bodies: []const Body }
| e f64 energy takes a const slice of bodies and produces a single f64 value on branch e.
~event pair_delta { bi: Body, bj: Body, dt: f64 }
| delta { dx: f64, dy: f64, dz: f64, dsq: f64, dt: f64, mi: f64, mj: f64 } pair_delta takes two bodies and a timestep, produces a delta branch with computed differences and derived values.
Why this matters: Event declarations aren’t function signatures—they’re continuation declarations. An event says “this thing can happen, and when it does, here are the possible continuations.” The inputs and outputs are explicit. You can reason about each event in isolation.
In Koru, events are pure by default. Purity flows transitively through the program—if an event only calls pure events, it’s pure. If it calls a ~proc (which can have side effects), that impurity propagates. The compiler tracks this, marking AST nodes with their purity status.
Subflows: Pure Transformations
The simplest way to implement an event is a subflow—a pure transformation from inputs to outputs:
~pair_delta = const({ dx: bi.x - bj.x, dy: bi.y - bj.y, dz: bi.z - bj.z })
| as d |> delta {
d.dx,
d.dy,
d.dz,
dsq: d.dx*d.dx + d.dy*d.dy + d.dz*d.dz,
dt,
mi: bi.mass,
mj: bj.mass
} Let’s break this down:
const({ dx: ..., dy: ..., dz: ... })- Compute intermediate values once| as d |>- Bind those values todfor the next stepdelta { ... }- Construct the output branch
The shorthand syntax is doing work here:
d.dx,expands todx: d.dx,dt,expands todt: dt,mi: bi.massis explicit because the field name differs from the expression
This is pure: Given the same bi, bj, and dt, you always get the same delta. No mutation, no side effects, no hidden dependencies.
~pair_force = const({ mag: dt / (dsq * @sqrt(dsq)) })
| as c |> result {
fx: dx * c.mag,
fy: dy * c.mag,
fz: dz * c.mag,
mi,
mj
} Same pattern: compute mag once (the inverse-cube factor), then construct the force result. Pure transformation.
Chained Captures: Functional Accumulation
The energy calculation needs to sum over all bodies (kinetic) and all pairs (potential). This is a fold—and in Koru, we express it with chained captures:
~energy = capture({ e: @as(f64, 0) })
| as ke |> for(bodies)
| each b |> captured { e: ke.e + 0.5 * b.mass * (b.vx*b.vx + b.vy*b.vy + b.vz*b.vz) }
| captured kinetic |> capture({ e: kinetic.e })
| as pe |> for(0..5)
| each i |> for(i+1..5)
| each j |> captured {
e: pe.e - (bodies[i].mass * bodies[j].mass) / @sqrt(
(bodies[i].x - bodies[j].x)*(bodies[i].x - bodies[j].x) +
(bodies[i].y - bodies[j].y)*(bodies[i].y - bodies[j].y) +
(bodies[i].z - bodies[j].z)*(bodies[i].z - bodies[j].z)
)
}
| captured result |> e { result.e } This looks dense. Let’s trace the flow:
Phase 1: Kinetic Energy
capture({ e: @as(f64, 0) }) // Start with e = 0
| as ke |> for(bodies) // For each body...
| each b |> captured { // ...accumulate:
e: ke.e + 0.5 * b.mass * (b.vx*b.vx + b.vy*b.vy + b.vz*b.vz)
}
| captured kinetic |> // When done, bind result to 'kinetic' The capture creates an accumulator. The for loop iterates bodies. Each | each b |> receives one body, and captured { e: ... } returns the updated accumulator.
When the loop completes, | captured kinetic |> receives the final accumulated value.
Phase 2: Potential Energy
capture({ e: kinetic.e }) // Start with kinetic energy
| as pe |> for(0..5) // For i in 0..5
| each i |> for(i+1..5) // For j in i+1..5
| each j |> captured { // Subtract potential:
e: pe.e - (bodies[i].mass * bodies[j].mass) / @sqrt(...)
}
| captured result |> e { result.e } // Return final energy The second capture chains from the first—it starts with kinetic.e. The nested loops iterate all unique pairs (i, j) where j > i. Each pair subtracts gravitational potential energy.
Why chaining works: Each capture is independent. The first computes kinetic energy. The second takes that result and computes total energy. No mutable variables. No loop indices leaking. Just data flowing through transformations.
On purity: captured { e: pe.e - ... } looks like mutation, but it isn’t. We’re not mutating pe—we’re returning a new value that becomes the accumulator for the next iteration. This is exactly how Haskell’s foldl works:
foldl (acc x -> acc + x) 0 xs Each iteration receives the previous accumulator and returns a new one. No mutation. The compiler might implement it as in-place update for efficiency (like GHC does with strict folds), but semantically it’s pure. Given the same inputs, captured {...} always produces the same output.
The Return
| captured result |> e { result.e } The final e { result.e } constructs the output branch. This is a scalar return—the event declared | e f64, and we’re returning just that float.
The Main Flow: Orchestration
Now we wire everything together:
~parse()
| n iters |> init()
| bodies b[mutable] |> offset(bodies: b[0..])
|> energy(bodies: b[0..])
| e e1 |> std.io:print.ln("{{ e1:d:.9 }}")
|> for(0..iters)
| each |> capture({ dv: ZERO_DV })
| as acc |> for(0..5)
| each i |> for(i+1..5)
| each j |> pair_delta(bi: b[i], bj: b[j], dt: DT)
| delta d |> pair_force(d.dx, d.dy, d.dz, d.dsq, d.dt, d.mi, d.mj)
| result f |> captured {
dv[i][0]: acc.dv[i][0] - f.fx*f.mj,
dv[i][1]: acc.dv[i][1] - f.fy*f.mj,
dv[i][2]: acc.dv[i][2] - f.fz*f.mj,
dv[j][0]: acc.dv[j][0] + f.fx*f.mi,
dv[j][1]: acc.dv[j][1] + f.fy*f.mi,
dv[j][2]: acc.dv[j][2] + f.fz*f.mi
}
| captured deltas |> apply(bodies: b[0..], dt: DT, deltas.dv)
| captured |> energy(bodies: b[0..])
| e e2 |> std.io:print.ln("{{ e2:d:.9 }}") Let’s read it top to bottom:
~parse()- Parse command-line args| n iters |>- Bind iteration countinit()- Initialize solar system| bodies b[mutable] |>- Bind bodies as mutable (we’ll update velocities)offset(bodies: b[0..])- Offset sun’s velocity so momentum is zeroenergy(bodies: b[0..])- Compute initial energy| e e1 |> print.ln(...)- Print itfor(0..iters)- Main simulation loop
Inside the loop:
capture({ dv: ZERO_DV })- Fresh velocity-delta accumulator each iteration- Nested loops over pairs - Compute forces
pair_delta→pair_force- Pipeline: positions → deltas → forcescaptured { dv[i][0]: ... }- Accumulate velocity changesapply(bodies: b[0..], dt: DT, deltas.dv)- Apply accumulated deltas| done |> energy(...)- After all iterations, compute final energy
Reasoning About State
The [mutable] annotation on b is the only place where mutation happens. Everything else is pure transformation:
pair_delta: pure (inputs → delta)pair_force: pure (delta → forces)energy: pure (bodies → float)captureloops: pure (accumulator updates are functional)
The mutation is isolated to apply, which explicitly takes bodies: []Body as a mutable slice. You can see exactly where state changes.
The Velocity Delta Accumulation
This is the trickiest part—let’s zoom in:
| result f |> captured {
dv[i][0]: acc.dv[i][0] - f.fx*f.mj,
dv[i][1]: acc.dv[i][1] - f.fy*f.mj,
dv[i][2]: acc.dv[i][2] - f.fz*f.mj,
dv[j][0]: acc.dv[j][0] + f.fx*f.mi,
dv[j][1]: acc.dv[j][1] + f.fy*f.mi,
dv[j][2]: acc.dv[j][2] + f.fz*f.mi
} For each pair (i, j):
- Body
iloses velocity proportional to force × mass ofj - Body
jgains velocity proportional to force × mass ofi
Newton’s third law, expressed as accumulator updates. The acc.dv[i][0] reads the current accumulated value; the captured { ... } returns the new accumulator with updated entries.
No indices are reused incorrectly. No off-by-one errors hiding in loop bounds. The structure of the code mirrors the structure of the physics.
Procs: When You Need Imperative Code
Some events use ~proc instead of subflows—for I/O, argument parsing, or when imperative style is clearer:
~proc parse {
const args = std.process.argsAlloc(std.heap.page_allocator) catch unreachable;
defer std.process.argsFree(std.heap.page_allocator, args);
if (args.len < 2) { std.debug.print("Usage: nbody <n>\n", .{}); unreachable; }
return .{ .n = std.fmt.parseInt(u32, args[1], 10) catch unreachable };
}
~proc apply {
for (bodies, 0..) |*b, i| {
b.vx += dv[i][0];
b.vy += dv[i][1];
b.vz += dv[i][2];
b.x += dt * b.vx;
b.y += dt * b.vy;
b.z += dt * b.vz;
}
} parse does I/O (reading args). apply mutates state (updating bodies). These are explicitly imperative—the ~proc keyword signals “here be side effects.”
The rest of the program is pure subflows. You can see the boundary.
What the Compiler Sees
Every subflow compiles to a simple function. pair_delta becomes:
fn pair_delta_handler(input: PairDeltaInput) PairDeltaOutput {
const d = .{ .dx = input.bi.x - input.bj.x, ... };
return .{ .delta = .{
.dx = d.dx,
.dy = d.dy,
.dz = d.dz,
.dsq = d.dx*d.dx + d.dy*d.dy + d.dz*d.dz,
...
}};
} The captures compile to loops with accumulators. The pipelines compile to function calls. There’s no magic runtime, no boxing, no dynamic dispatch.
The abstractions disappear. What remains is the code you would have written by hand—but structured in a way that’s easier to read, easier to reason about, and harder to get wrong.
Performance
Koru compiles to Zig. So the real test of zero-cost abstractions is: does our high-level code run as fast as hand-written Zig?
We benchmarked against idiomatic implementations in six languages. Same algorithm. Same output precision. 50 million iterations.
The Results
Language Time vs Koru
─────────────────────────────────────────
Rust 1.35s 0.97x
Koru (pure subflows) 1.39s 1.00x
Zig (hand-written) 1.41s 1.01x
Go 1.88s 1.35x
C (gcc -O3) 1.97s 1.42x
Python 216.7s 156x The Key Comparison: Koru vs Zig
Koru and hand-written Zig are identical. In this run, Koru was actually 2% faster—but that’s noise. Run it again and Zig might win. The point is they trade places depending on system load, cache state, and cosmic rays.
This is what zero-cost abstraction means. Not “pretty close.” Not “acceptable overhead.” Identical. The captures, the subflows, the pipelines, the branch constructors—they compile to exactly the code you would have written by hand.
Wait, Koru Beat C?
Yes. And this deserves explanation.
The C implementation uses gcc -O3 -march=native. It’s the reference implementation from the Benchmarks Game. It’s idiomatic C—straightforward loops, no SIMD intrinsics, no manual unrolling.
Zig and Rust both use LLVM, which has had decades of optimization work. Modern LLVM generates better code than GCC for this particular workload. The C code isn’t slow—LLVM is just very, very good.
The point isn’t “Koru beats C.” The point is: Koru’s abstractions add zero overhead on top of what Zig/LLVM produces.
Idiomatic Code Matters
Every implementation here is idiomatic for its language:
- Python: Lists, for loops,
math.sqrt. No NumPy, no Cython. - Go: Slices, range loops,
math.Sqrt. No unsafe, no assembly. - C: Arrays, while loops, pointer arithmetic. Standard
-O3. - Zig: Slices, for loops,
@sqrt. StandardReleaseFast. - Rust: Iterators, ranges,
f64::sqrt. Standard--release. - Koru: Events, captures, subflows. The whole point of the language.
We’re not comparing “optimized C with SIMD” against “naive Python.” We’re comparing what a competent developer would write in each language when solving the same problem.
Python is 156x slower than Koru. That’s the cost of interpretation and dynamic typing. Go is 35% slower—that’s GC overhead and less aggressive optimization. C is 42% slower than Rust/Zig/Koru—that’s GCC vs LLVM.
And Koru? Identical to Zig. The abstractions are free.
What We Learned
Koru lets you write code that:
- Reads like a specification - The main flow is the algorithm, not implementation details
- Isolates mutation - Pure subflows for computation, explicit procs for state changes
- Composes cleanly - Events pipeline into events, captures chain into captures
- Runs fast - The abstractions compile to exactly what you’d write by hand
You don’t have to choose between clarity and performance. The N-body simulation proves it: declarative, functional-style code that matches hand-tuned Rust.
Run It Yourself
The complete implementation is in the Koru repository:
tests/regression/900_EXAMPLES_SHOWCASE/910_LANGUAGE_SHOOTOUT/2101g_nbody_subflow/ Build and benchmark:
koru build input.kz -o nbody
hyperfine './nbody 50000000' './nbody_rust 50000000'