Emulation as Progressive Optimization

October 29, 2025 · 18 min read

ASPIRATIONAL — Will turn into roapmap

The traditional emulation trilemma: fast OR correct OR maintainable - pick two.

Static recompilation is fast but brittle (breaks on self-modifying code)
Interpretation is correct but slow (10-100x overhead)
JIT is complex to build (months of infrastructure work)

What if we could have all three? What if we could start with a correct interpreter and progressively optimize it to native speed, mixing hand-written code and compiler passes freely, all while proving correctness through property tests?

The Progressive Optimization Approach

Instead of choosing upfront, we build optimization progressively:

Week 1: Interpreter - Slow but correct Week 2: Profile - Find the 20% that takes 80% of time Week 3: Optimize - Hand-write OR generate fast paths Week 4: Validate - Prove equivalence, measure speedup Week 5: Iterate - Repeat until native speed

The key insight: compiler coding IS application coding. There’s no boundary between “writing the emulator” and “writing compiler passes that optimize the emulator.” It’s all just Koru code making Koru code faster.

Example: Super Mario Bros Emulator

Let me show you what this looks like in practice.

Week 1: Pure Interpreter (5 FPS)

// Start with naive interpreter
~event cpu_execute { cycles: u32 }
| done { cycles_executed: u32 }

~proc cpu_execute {
    var cycles_left = cycles;
    while (cycles_left > 0) {
        const opcode = readMemory(state.pc);
        // Interpret instruction byte-by-byte
        const cycles_used = executeInstruction(opcode, &state);
        cycles_left -= cycles_used;
    }
    return .{ .done = .{ .cycles_executed = cycles } };
}

~event ppu_render { scanline: u8 }
| rendered { pixels: []u8 }

~proc ppu_render {
    // Naive pixel-by-pixel rendering
    for (0..240) |x| {
        for (0..256) |y| {
            pixels[y][x] = computePixel(x, y, &ppu_state);
        }
    }
    return .{ .rendered = .{ .pixels = pixels } };
}

Result: Functional emulator, runs at ~5 FPS. Slow but correct.

Week 2: Profile (Find Hot Spots)

koruc --profile nes_emulator.kz

{
  "hotspots": [
    {
      "call_site": "ppu_render:a3k8df2m9vx4",
      "flow": "nes_frame",
      "calls": 60,
      "avg_ms": 137,
      "total_ms": 8220,
      "percent": 82.2,
      "recommendation": "This dominates runtime - optimize first"
    },
    {
      "call_site": "cpu_execute:c7m4kl8p9zh3",
      "flow": "nes_frame",
      "calls": 60,
      "avg_ms": 15,
      "total_ms": 900,
      "percent": 9.0
    }
  ]
}

Insight: PPU rendering is 82% of runtime. Optimize that first!

Week 3: Hand-Write Native Variant (30 FPS)

PPU rendering is well-understood. Just write it cleanly in Zig:

// Create native variant - hand-written optimization
~proc ppu_render|zig(native) {
    // Clean, optimized Zig code
    const bg_tiles = fetchBackgroundTiles(&ppu);
    const sprite_data = fetchSpriteData(&ppu);

    // Batch render with SIMD when possible
    renderBackgroundBatch(pixels, bg_tiles);
    renderSpritesBatch(pixels, sprite_data);

    return .{ .rendered = .{ .pixels = pixels } };
}

Property test to prove equivalence:

test "ppu_render variants equivalent" {
    for (test_cases) |case| {
        const result_baseline = ppu_render|zig(baseline)(case.input);
        const result_native = ppu_render|zig(native)(case.input);

        // Pixels must match exactly
        try expectEqualPixels(result_baseline.pixels, result_native.pixels);
    }
}

Register in build manifest:

// app.build.zon - Tracks which variant to use where
{
  .call_sites = {
    .ppu_render_a3k8df2m9vx4 = {
      .variant = "native",
      .notes = "Hand-written, 18x faster than baseline",
    },
  },

  .variants = {
    .ppu_render = {
      .baseline = "nes/ppu_render.kz",
      .native = "nes/optimized/ppu_render_native.kz",
    },
  },
}

Result: 30 FPS! Native variant runs 18x faster, proven equivalent.

Week 4: Compiler Pass for Pattern (45 FPS)

Notice many 6502 instructions follow a pattern: LDA addr / ADC #imm / STA addr (load, add, store). Write a compiler pass and hook it into the compilation pipeline:

~import "$std/compiler"

// Your custom optimization pass
~[comptime]event optimize_6502_arithmetic { ctx: compiler.CompilerContext }
| optimized { ctx: compiler.CompilerContext }

~[comptime]proc optimize_6502_arithmetic {
    // Find sequences: load -> arithmetic -> store
    for (ctx.ast.procs) |proc| {
        if (matchesArithmeticPattern(proc)) {
            // Generate optimized variant
            const optimized = generateNativeArithmetic(proc);
            addVariant(proc, "native", optimized);
        }
    }

    var updated_ctx = ctx;
    updated_ctx.passes_completed += 1;
    return .{ .optimized = .{ .ctx = updated_ctx } };
}

// Override the compiler pipeline to include your pass!
~compiler.coordinate =
    compiler.context.create(ast: ast, allocator: allocator)
    | created c0 |> compiler.coordinate.frontend(ctx: c0.ctx)
      | continued c1 |> optimize_6502_arithmetic(ctx: c1.ctx)  // ← Your pass here!
        | optimized c2 |> compiler.coordinate.analysis(ctx: c2.ctx)
          | continued c3 |> compiler.coordinate.emission(ctx: c3.ctx)
            | continued c4 |> coordinated {
                ast: c4.ctx.ast,
                code: c4.code,
                metrics: "Custom 6502 optimization pipeline"
              }

Result: 45 FPS! Compiler pass optimized 50+ instructions automatically.

The beautiful part? This is just Koru code. Your optimization pass is a regular event that runs during compilation. No special compiler hooks, no plugins, no external tools - just events and flows.

Week 5: Hand-Write Physics (55 FPS)

Mario’s jump physics is simple - easier to hand-write than to generate:

// Hand-written - simpler than compiler pass for this!
~proc mario_physics|zig(native) {
    if (input.a_button and mario.on_ground) {
        mario.y_velocity = -12;  // Jump!
    }

    // Apply gravity
    mario.y_velocity += 1;
    mario.y += mario.y_velocity;

    // Ground collision
    if (mario.y >= GROUND_LEVEL) {
        mario.y = GROUND_LEVEL;
        mario.y_velocity = 0;
        mario.on_ground = true;
    }

    return .{ .updated = .{ .state = mario } };
}

Result: 55 FPS! Physics is now native speed.

Week 6: Fuse Memory Operations (60 FPS!)

Write one more pass to combine multiple memory reads/writes:

~[comptime]proc fuse_memory_ops(ast: *AST) {
    // Find: read(addr1), read(addr2), read(addr3)
    // Generate: read_batch([addr1, addr2, addr3])
    for (ast.hotPaths()) |path| {
        const reads = findConsecutiveReads(path);
        if (reads.len >= 3) {
            generateBatchRead(reads);
        }
    }
}

Result: 60 FPS - Full speed! Native-quality emulation.

The Optimization Manifest

All optimizations tracked in app.build.zon:

{
  .call_sites = {
    // Week 3: Hand-written PPU rendering
    .ppu_render_a3k8df2m9vx4 = {
      .variant = "native",
      .speedup = "18.2x",
      .generated_by = "hand_written",
      .date = "2025-10-22",
    },

    // Week 4: Compiler pass for arithmetic
    .cpu_adc_c7m4kl8p9zh3 = {
      .variant = "native",
      .speedup = "3.2x",
      .generated_by = "compiler_pass:optimize_6502_arithmetic",
      .date = "2025-10-24",
    },

    // Week 5: Hand-written physics
    .mario_physics_h3k9df8a2xv4 = {
      .variant = "native",
      .speedup = "4.2x",
      .generated_by = "hand_written",
      .date = "2025-10-26",
      .notes = "Simple enough to hand-write, faster than generating",
    },
  },

  .meta = {
    .total_call_sites = 247,
    .optimized_call_sites = 89,
    .coverage = "36%",
    .cumulative_speedup = "12x",
  },
}

The Key Insight: Two Paths to Optimization

When you hit a hot spot, you have TWO options:

Option 1: Hand-Write a Variant

When: The optimization is simple and obvious Example: Mario physics is just arithmetic

~proc mario_jump|zig(native) {
    // 5 minutes to write, immediately clear
    if (input.a and on_ground) {
        velocity_y = -12;
    }
}

Option 2: Write a Compiler Pass

When: You see a pattern that occurs many times Example: 50 6502 arithmetic instructions all follow same pattern

~[comptime]proc optimize_arithmetic(ast: *AST) {
    // 2 hours to write once,
    // optimizes 50 functions automatically
    for (ast.procs) |proc| {
        if (matchesPattern(proc)) {
            generateOptimized(proc);
        }
    }
}

The beautiful part: Both are just Koru code. Both property tested. Both tracked in manifest. Compiler coding IS application coding.

Call-Site GeoHashing: Refactor-Resistant Tracking

How do we track “which call site” across refactorings? Structural hashing.

Each call site gets a 12-character hash based on its context:

ppu_render:a3k8df2m9vx4

Chars 0-1:  a3   = Module context
Chars 2-3:  k8   = Flow context
Chars 4-5:  df   = Position in flow
Chars 6-8:  2m9  = Surrounding calls
Chars 9-11: vx4  = Invocation specifics

After refactoring:

koruc --check-manifest app.build.zon nes_emulator.kz

✓ 85/89 call sites matched exactly
⚠ 4 call sites drifted (high similarity)

ppu_render:a3k8df2m9vx4 → a3k8df2m9vu8
Similarity: ▓▓▓▓▓▓▓▓▓▓▓░ (11/12 chars)
Confidence: 99.9% - Auto-update recommended

The hash is calculated from AST (no storage needed) and survives refactoring (structural not positional).

Property Testing: The Truth Machine

Every optimization must prove equivalence:

test "cpu_adc variants equivalent" {
    for (10000..random_test_cases()) |case| {
        const baseline = cpu_adc|zig(baseline)(case);
        const native = cpu_adc|zig(native)(case);

        // Must produce identical CPU state
        try expectEqual(baseline.a_register, native.a_register);
        try expectEqual(baseline.flags, native.flags);
        try expectEqual(baseline.cycles, native.cycles);
    }
}

The baseline is sacred - never deleted, always available for testing.

If property tests fail → Optimization rejected, baseline untouched. If property tests pass → Optimization proven correct, deploy with confidence.

The Pragmatic Approach to Hard Problems

Self-modifying code and computed jumps deterred us before. Not anymore!

Strategy: Detect + Fallback

// Runtime tap detects writes to code space
~[runtime]tap(memory_write -> code_modification_detector)
| Profile write |> {
    if (isCodePage(addr)) {
        invalidateCompiledCode(addr);
        markAsInterpreted(addr);
    }
}

Common case: Most code doesn’t self-modify → Compile to native Edge case: Self-modifying code detected → Fall back to interpreter Result: Fast where possible, correct everywhere

Per-Call-Site Optimization

~flow nes_frame {
    cpu_execute(cycles: 29780)  // Hot: compile to native
    |> render_frame()
    |> cpu_execute(cycles: 100) // Rare: keep interpreted
}

Same event, different call sites, different strategies!

{
  .cpu_execute_c7m4kl8p9zh3 = {
    .variant = "jit",  // Frame loop: optimize aggressively
  },
  .cpu_execute_c7m4kl8p9zh8 = {
    .variant = "baseline",  // Menu code: rare, keep simple
  },
}

Why This Wasn’t Possible Before

Traditional languages force a choice:

C/C++: Compiler is external, can’t extend per-project Rust: Macros help but can’t touch optimizer Python: Can instrument but can’t compile to native LLVM: Can write passes but they’re separate from app

Koru: The compiler IS your project. Extending it IS coding.

The Vision: Every Architecture, Native Speed

This approach works for ANY architecture:

6502 → NES, Commodore 64, Apple II
Z80 → Game Boy, Sega Master System
ARM7 → Game Boy Advance, Nintendo DS
DSP56300 → Synthesizers (Osiris, Virus)
x86 → Legacy software preservation
WASM → Web apps compiled to native

Start with interpreter (correct), progressively optimize (fast), property test everything (proven).

What This Enables

Preservation - Old hardware never dies Performance - Native-speed emulation Research - Experiment with architectures Security - Sandboxed legacy code Education - Understand hardware by reimplementing

Compiler Coding IS Application Coding

The breakthrough insight:

There’s no boundary between:

Writing the emulator (application code)
Writing optimizations (hand-written variants)
Writing compiler passes (code generation)

It’s all just Koru code making Koru code faster.

When you hit a hot spot:

Is it simple? → Hand-write a variant (5 minutes)
Is it a pattern? → Write a compiler pass (2 hours, optimizes 50 functions)
Is it complex? → Keep it interpreted (fallback)

All three are valid. All three are Koru code. Mix freely.

The Boundless Emulation Future

We’re not just building better emulators. We’re building a world where any executable artifact can be transparently translated, optimized, and run on any hardware.

Preservation meets performance. Correctness meets speed. All through progressive optimization.

This is what Koru makes possible.

Want to learn more? Check out Call-Site GeoHashing for the technical details, or Building an Emulator in Koru for a step-by-step tutorial.