Emulation as Progressive Optimization
ASPIRATIONAL — Will turn into roapmap
The traditional emulation trilemma: fast OR correct OR maintainable - pick two.
- Static recompilation is fast but brittle (breaks on self-modifying code)
- Interpretation is correct but slow (10-100x overhead)
- JIT is complex to build (months of infrastructure work)
What if we could have all three? What if we could start with a correct interpreter and progressively optimize it to native speed, mixing hand-written code and compiler passes freely, all while proving correctness through property tests?
The Progressive Optimization Approach
Instead of choosing upfront, we build optimization progressively:
Week 1: Interpreter - Slow but correct Week 2: Profile - Find the 20% that takes 80% of time Week 3: Optimize - Hand-write OR generate fast paths Week 4: Validate - Prove equivalence, measure speedup Week 5: Iterate - Repeat until native speed
The key insight: compiler coding IS application coding. There’s no boundary between “writing the emulator” and “writing compiler passes that optimize the emulator.” It’s all just Koru code making Koru code faster.
Example: Super Mario Bros Emulator
Let me show you what this looks like in practice.
Week 1: Pure Interpreter (5 FPS)
// Start with naive interpreter
~event cpu_execute { cycles: u32 }
| done { cycles_executed: u32 }
~proc cpu_execute {
var cycles_left = cycles;
while (cycles_left > 0) {
const opcode = readMemory(state.pc);
// Interpret instruction byte-by-byte
const cycles_used = executeInstruction(opcode, &state);
cycles_left -= cycles_used;
}
return .{ .done = .{ .cycles_executed = cycles } };
}
~event ppu_render { scanline: u8 }
| rendered { pixels: []u8 }
~proc ppu_render {
// Naive pixel-by-pixel rendering
for (0..240) |x| {
for (0..256) |y| {
pixels[y][x] = computePixel(x, y, &ppu_state);
}
}
return .{ .rendered = .{ .pixels = pixels } };
} Result: Functional emulator, runs at ~5 FPS. Slow but correct.
Week 2: Profile (Find Hot Spots)
koruc --profile nes_emulator.kz {
"hotspots": [
{
"call_site": "ppu_render:a3k8df2m9vx4",
"flow": "nes_frame",
"calls": 60,
"avg_ms": 137,
"total_ms": 8220,
"percent": 82.2,
"recommendation": "This dominates runtime - optimize first"
},
{
"call_site": "cpu_execute:c7m4kl8p9zh3",
"flow": "nes_frame",
"calls": 60,
"avg_ms": 15,
"total_ms": 900,
"percent": 9.0
}
]
} Insight: PPU rendering is 82% of runtime. Optimize that first!
Week 3: Hand-Write Native Variant (30 FPS)
PPU rendering is well-understood. Just write it cleanly in Zig:
// Create native variant - hand-written optimization
~proc ppu_render|zig(native) {
// Clean, optimized Zig code
const bg_tiles = fetchBackgroundTiles(&ppu);
const sprite_data = fetchSpriteData(&ppu);
// Batch render with SIMD when possible
renderBackgroundBatch(pixels, bg_tiles);
renderSpritesBatch(pixels, sprite_data);
return .{ .rendered = .{ .pixels = pixels } };
} Property test to prove equivalence:
test "ppu_render variants equivalent" {
for (test_cases) |case| {
const result_baseline = ppu_render|zig(baseline)(case.input);
const result_native = ppu_render|zig(native)(case.input);
// Pixels must match exactly
try expectEqualPixels(result_baseline.pixels, result_native.pixels);
}
} Register in build manifest:
// app.build.zon - Tracks which variant to use where
{
.call_sites = {
.ppu_render_a3k8df2m9vx4 = {
.variant = "native",
.notes = "Hand-written, 18x faster than baseline",
},
},
.variants = {
.ppu_render = {
.baseline = "nes/ppu_render.kz",
.native = "nes/optimized/ppu_render_native.kz",
},
},
} Result: 30 FPS! Native variant runs 18x faster, proven equivalent.
Week 4: Compiler Pass for Pattern (45 FPS)
Notice many 6502 instructions follow a pattern: LDA addr / ADC #imm / STA addr (load, add, store). Write a compiler pass and hook it into the compilation pipeline:
~import "$std/compiler"
// Your custom optimization pass
~[comptime]event optimize_6502_arithmetic { ctx: compiler.CompilerContext }
| optimized { ctx: compiler.CompilerContext }
~[comptime]proc optimize_6502_arithmetic {
// Find sequences: load -> arithmetic -> store
for (ctx.ast.procs) |proc| {
if (matchesArithmeticPattern(proc)) {
// Generate optimized variant
const optimized = generateNativeArithmetic(proc);
addVariant(proc, "native", optimized);
}
}
var updated_ctx = ctx;
updated_ctx.passes_completed += 1;
return .{ .optimized = .{ .ctx = updated_ctx } };
}
// Override the compiler pipeline to include your pass!
~compiler.coordinate =
compiler.context.create(ast: ast, allocator: allocator)
| created c0 |> compiler.coordinate.frontend(ctx: c0.ctx)
| continued c1 |> optimize_6502_arithmetic(ctx: c1.ctx) // ← Your pass here!
| optimized c2 |> compiler.coordinate.analysis(ctx: c2.ctx)
| continued c3 |> compiler.coordinate.emission(ctx: c3.ctx)
| continued c4 |> coordinated {
ast: c4.ctx.ast,
code: c4.code,
metrics: "Custom 6502 optimization pipeline"
} Result: 45 FPS! Compiler pass optimized 50+ instructions automatically.
The beautiful part? This is just Koru code. Your optimization pass is a regular event that runs during compilation. No special compiler hooks, no plugins, no external tools - just events and flows.
Week 5: Hand-Write Physics (55 FPS)
Mario’s jump physics is simple - easier to hand-write than to generate:
// Hand-written - simpler than compiler pass for this!
~proc mario_physics|zig(native) {
if (input.a_button and mario.on_ground) {
mario.y_velocity = -12; // Jump!
}
// Apply gravity
mario.y_velocity += 1;
mario.y += mario.y_velocity;
// Ground collision
if (mario.y >= GROUND_LEVEL) {
mario.y = GROUND_LEVEL;
mario.y_velocity = 0;
mario.on_ground = true;
}
return .{ .updated = .{ .state = mario } };
} Result: 55 FPS! Physics is now native speed.
Week 6: Fuse Memory Operations (60 FPS!)
Write one more pass to combine multiple memory reads/writes:
~[comptime]proc fuse_memory_ops(ast: *AST) {
// Find: read(addr1), read(addr2), read(addr3)
// Generate: read_batch([addr1, addr2, addr3])
for (ast.hotPaths()) |path| {
const reads = findConsecutiveReads(path);
if (reads.len >= 3) {
generateBatchRead(reads);
}
}
} Result: 60 FPS - Full speed! Native-quality emulation.
The Optimization Manifest
All optimizations tracked in app.build.zon:
{
.call_sites = {
// Week 3: Hand-written PPU rendering
.ppu_render_a3k8df2m9vx4 = {
.variant = "native",
.speedup = "18.2x",
.generated_by = "hand_written",
.date = "2025-10-22",
},
// Week 4: Compiler pass for arithmetic
.cpu_adc_c7m4kl8p9zh3 = {
.variant = "native",
.speedup = "3.2x",
.generated_by = "compiler_pass:optimize_6502_arithmetic",
.date = "2025-10-24",
},
// Week 5: Hand-written physics
.mario_physics_h3k9df8a2xv4 = {
.variant = "native",
.speedup = "4.2x",
.generated_by = "hand_written",
.date = "2025-10-26",
.notes = "Simple enough to hand-write, faster than generating",
},
},
.meta = {
.total_call_sites = 247,
.optimized_call_sites = 89,
.coverage = "36%",
.cumulative_speedup = "12x",
},
} The Key Insight: Two Paths to Optimization
When you hit a hot spot, you have TWO options:
Option 1: Hand-Write a Variant
When: The optimization is simple and obvious Example: Mario physics is just arithmetic
~proc mario_jump|zig(native) {
// 5 minutes to write, immediately clear
if (input.a and on_ground) {
velocity_y = -12;
}
} Option 2: Write a Compiler Pass
When: You see a pattern that occurs many times Example: 50 6502 arithmetic instructions all follow same pattern
~[comptime]proc optimize_arithmetic(ast: *AST) {
// 2 hours to write once,
// optimizes 50 functions automatically
for (ast.procs) |proc| {
if (matchesPattern(proc)) {
generateOptimized(proc);
}
}
} The beautiful part: Both are just Koru code. Both property tested. Both tracked in manifest. Compiler coding IS application coding.
Call-Site GeoHashing: Refactor-Resistant Tracking
How do we track “which call site” across refactorings? Structural hashing.
Each call site gets a 12-character hash based on its context:
ppu_render:a3k8df2m9vx4
Chars 0-1: a3 = Module context
Chars 2-3: k8 = Flow context
Chars 4-5: df = Position in flow
Chars 6-8: 2m9 = Surrounding calls
Chars 9-11: vx4 = Invocation specifics After refactoring:
koruc --check-manifest app.build.zon nes_emulator.kz
✓ 85/89 call sites matched exactly
⚠ 4 call sites drifted (high similarity)
ppu_render:a3k8df2m9vx4 → a3k8df2m9vu8
Similarity: ▓▓▓▓▓▓▓▓▓▓▓░ (11/12 chars)
Confidence: 99.9% - Auto-update recommended The hash is calculated from AST (no storage needed) and survives refactoring (structural not positional).
Property Testing: The Truth Machine
Every optimization must prove equivalence:
test "cpu_adc variants equivalent" {
for (10000..random_test_cases()) |case| {
const baseline = cpu_adc|zig(baseline)(case);
const native = cpu_adc|zig(native)(case);
// Must produce identical CPU state
try expectEqual(baseline.a_register, native.a_register);
try expectEqual(baseline.flags, native.flags);
try expectEqual(baseline.cycles, native.cycles);
}
} The baseline is sacred - never deleted, always available for testing.
If property tests fail → Optimization rejected, baseline untouched. If property tests pass → Optimization proven correct, deploy with confidence.
The Pragmatic Approach to Hard Problems
Self-modifying code and computed jumps deterred us before. Not anymore!
Strategy: Detect + Fallback
// Runtime tap detects writes to code space
~[runtime]tap(memory_write -> code_modification_detector)
| Profile write |> {
if (isCodePage(addr)) {
invalidateCompiledCode(addr);
markAsInterpreted(addr);
}
} Common case: Most code doesn’t self-modify → Compile to native Edge case: Self-modifying code detected → Fall back to interpreter Result: Fast where possible, correct everywhere
Per-Call-Site Optimization
~flow nes_frame {
cpu_execute(cycles: 29780) // Hot: compile to native
|> render_frame()
|> cpu_execute(cycles: 100) // Rare: keep interpreted
} Same event, different call sites, different strategies!
{
.cpu_execute_c7m4kl8p9zh3 = {
.variant = "jit", // Frame loop: optimize aggressively
},
.cpu_execute_c7m4kl8p9zh8 = {
.variant = "baseline", // Menu code: rare, keep simple
},
} Why This Wasn’t Possible Before
Traditional languages force a choice:
C/C++: Compiler is external, can’t extend per-project Rust: Macros help but can’t touch optimizer Python: Can instrument but can’t compile to native LLVM: Can write passes but they’re separate from app
Koru: The compiler IS your project. Extending it IS coding.
The Vision: Every Architecture, Native Speed
This approach works for ANY architecture:
- 6502 → NES, Commodore 64, Apple II
- Z80 → Game Boy, Sega Master System
- ARM7 → Game Boy Advance, Nintendo DS
- DSP56300 → Synthesizers (Osiris, Virus)
- x86 → Legacy software preservation
- WASM → Web apps compiled to native
Start with interpreter (correct), progressively optimize (fast), property test everything (proven).
What This Enables
Preservation - Old hardware never dies Performance - Native-speed emulation Research - Experiment with architectures Security - Sandboxed legacy code Education - Understand hardware by reimplementing
Compiler Coding IS Application Coding
The breakthrough insight:
There’s no boundary between:
- Writing the emulator (application code)
- Writing optimizations (hand-written variants)
- Writing compiler passes (code generation)
It’s all just Koru code making Koru code faster.
When you hit a hot spot:
- Is it simple? → Hand-write a variant (5 minutes)
- Is it a pattern? → Write a compiler pass (2 hours, optimizes 50 functions)
- Is it complex? → Keep it interpreted (fallback)
All three are valid. All three are Koru code. Mix freely.
The Boundless Emulation Future
We’re not just building better emulators. We’re building a world where any executable artifact can be transparently translated, optimized, and run on any hardware.
Preservation meets performance. Correctness meets speed. All through progressive optimization.
This is what Koru makes possible.
Want to learn more? Check out Call-Site GeoHashing for the technical details, or Building an Emulator in Koru for a step-by-step tutorial.