Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Performance

Execution tiers

zwasm uses tiered execution:

  1. Interpreter: All functions start as register IR, executed by a dispatch loop. Fast startup, no compilation overhead.
  2. JIT (ARM64/x86_64): Hot functions are compiled to native code when call count or back-edge count exceeds a threshold.

When JIT kicks in

  • Call threshold: After ~8 calls to the same function
  • Back-edge counting: Hot loops trigger JIT faster (loop iterations count toward the threshold)
  • Adaptive: The threshold adjusts based on function characteristics

Once JIT-compiled, all subsequent calls to that function execute native machine code directly, bypassing the interpreter.

Binary size and memory

MetricValue
Binary size (ReleaseSafe)~1.2 MB
Runtime memory (fib benchmark)~3.5 MB RSS
wasmtime binary for comparison56.3 MB

zwasm is ~40x smaller than wasmtime.

Benchmark results

Representative benchmarks comparing zwasm against wasmtime 41.0.1, Bun 1.3.8, and Node v24.13.0 on Apple M4 Pro. 16 of 29 benchmarks match or beat wasmtime. 25/29 within 1.5x.

BenchmarkzwasmwasmtimeBunNode
nqueens(8)2 ms5 ms14 ms23 ms
nbody(1M)22 ms22 ms32 ms36 ms
gcd(12K,67K)2 ms5 ms14 ms23 ms
tak(24,16,8)5 ms9 ms17 ms29 ms
sieve(1M)5 ms7 ms17 ms29 ms
fib(35)46 ms51 ms36 ms52 ms
st_fib2900 ms674 ms353 ms389 ms

zwasm uses 3-4x less memory than wasmtime and 8-10x less than Bun/Node.

Full results (29 benchmarks): bench/runtime_comparison.yaml

SIMD performance

SIMD (v128) operations are JIT-compiled to native NEON (ARM64, 253/256 opcodes) and SSE (x86_64, 244/256 opcodes). v128 values are stored as split 64-bit halves in the register file.

Benchmarkzwasm scalarzwasm SIMDwasmtime SIMDSIMD speedup
image_blend (128x128)73 ms16 ms12 ms4.7x
matrix_mul (16x16)10 ms6 ms8 ms1.6x
byte_search (64KB)52 ms43 ms5 ms1.2x
dot_product (4096)142 ms190 ms15 ms0.75x

matrix_mul beats wasmtime. image_blend within 1.3x. dot_product is slower due to v128.load-heavy inner loop and split storage overhead.

Compiler-generated SIMD code (C -msimd128) shows larger gaps due to patterns like i16x8.replace_lane that are expensive with split v128 storage. Future work: contiguous v128 register storage (W37) to eliminate this overhead.

Full data: bench/simd_comparison.yaml

Benchmark methodology

All measurements use hyperfine with ReleaseSafe builds:

# Quick check (1 run, no warmup)
bash bench/run_bench.sh --quick

# Full measurement (3 runs, 1 warmup)
bash bench/run_bench.sh

# Record to history
bash bench/record.sh --id="X" --reason="description"

Benchmark layers

LayerCountDescription
WAT micro5Hand-written: fib, tak, sieve, nbody, nqueens
TinyGo11TinyGo compiler output: same algorithms + string ops
Shootout5Sightglass shootout suite (WASI)
Real-world6Rust, C, C++ compiled to Wasm (matrix, math, string, sort)
GC2GC proposal: struct allocation, tree traversal
SIMD10WAT microbench (4) + C -msimd128 real-world (5), scalar/SIMD

CI regression detection

PRs are automatically checked for performance regressions:

  • 6 representative benchmarks run on both base and PR branch
  • Fails if any benchmark regresses by more than 20%
  • Same runner ensures fair comparison

Performance tips

  • ReleaseSafe: Always use for production. Debug is 5-10x slower.
  • Hot functions: Functions called frequently will be JIT-compiled automatically.
  • Fuel limit: --fuel adds overhead per instruction. Only use for untrusted code.
  • Memory: Wasm modules with linear memory allocate guard pages. Initial RSS is ~3.5 MB regardless of module size.