Performance
Execution tiers
zwasm uses tiered execution:
- Interpreter: All functions start as register IR, executed by a dispatch loop. Fast startup, no compilation overhead.
- JIT (ARM64/x86_64): Hot functions are compiled to native code when call count or back-edge count exceeds a threshold.
When JIT kicks in
- Call / back-edge threshold:
HOT_THRESHOLD = 3(lowered from 10 in W38). A function is promoted to JIT after 3 calls or 3 back-edges in a hot loop. - Back-edge counting: hot loops are detected without waiting for the call threshold; loop iterations count individually.
Once JIT-compiled, all subsequent calls to that function execute native machine code directly, bypassing the register-IR interpreter.
Binary size and memory
| Metric | Value |
|---|---|
| Binary size (ReleaseSafe, stripped) | ~1.20 MB Mac / ~1.56 MB Linux |
| CI ceiling (stripped) | 1.60 MB |
| Runtime memory (fib benchmark) | ~3.5 MB RSS |
| wasmtime binary for comparison | ~56 MB |
zwasm is roughly 35× smaller than wasmtime on Linux and 47× smaller on Mac.
Benchmark results
Representative benchmarks comparing zwasm against wasmtime 41.0.1, Bun 1.3.8, and Node v24.13.0 on Apple M4 Pro.
The majority of the 29 benchmarks match or beat wasmtime; a few compute-heavy long-running ones (e.g. st_fib2) still trail Cranelift AOT.
| Benchmark | zwasm | wasmtime | Bun | Node |
|---|---|---|---|---|
| nqueens(8) | 2 ms | 5 ms | 14 ms | 23 ms |
| nbody(1M) | 22 ms | 22 ms | 32 ms | 36 ms |
| gcd(12K,67K) | 2 ms | 5 ms | 14 ms | 23 ms |
| tak(24,16,8) | 5 ms | 9 ms | 17 ms | 29 ms |
| sieve(1M) | 5 ms | 7 ms | 17 ms | 29 ms |
| fib(35) | 46 ms | 51 ms | 36 ms | 52 ms |
| st_fib2 | 900 ms | 674 ms | 353 ms | 389 ms |
zwasm uses 3-4x less memory than wasmtime and 8-10x less than Bun/Node.
Full results (29 benchmarks): bench/runtime_comparison.yaml
SIMD performance
SIMD (v128) operations are JIT-compiled to native NEON (ARM64, 253/256 opcodes) and SSE (x86_64, 244/256 opcodes). v128 values now use contiguous register storage (W37) with a Q-cache over Q16–Q31 / XMM6–XMM15 (W43, W44) to keep hot vectors resident.
| Benchmark | zwasm scalar | zwasm SIMD | wasmtime SIMD | scalar→SIMD |
|---|---|---|---|---|
| image_blend (128x128) | 73 ms | 16 ms | 12 ms | 4.7× |
| matrix_mul (16x16) | 10 ms | 6 ms | 8 ms | 1.6× |
| byte_search (64 KB) | 52 ms | 43 ms | 5 ms | 1.2× |
| dot_product (4096) | 142 ms | 190 ms | 15 ms | 0.75× |
matrix_mul beats wasmtime; image_blend is within 1.4×. byte_search and
dot_product still trail wasmtime — the remaining gaps are dominated by
patterns like i16x8.replace_lane and v128.load-heavy inner loops in
compiler-generated SIMD code (C -msimd128). Loop-header eviction in the
Q-cache (tracked as W45) is the next lever.
Full data: bench/simd_comparison.yaml
Benchmark methodology
All measurements use hyperfine with ReleaseSafe builds:
# Quick check (1 run, no warmup)
bash bench/run_bench.sh --quick
# Full measurement (5 runs, 3 warmup)
bash bench/run_bench.sh
# Record to history
bash bench/record.sh --id="X" --reason="description"
Benchmark layers
| Layer | Count | Description |
|---|---|---|
| WAT micro | 5 | Hand-written: fib, tak, sieve, nbody, nqueens |
| TinyGo | 11 | TinyGo compiler output: same algorithms + string ops |
| Shootout | 5 | Sightglass shootout suite (WASI) |
| Real-world | 6 | Rust, C, C++ compiled to Wasm (matrix, math, string, sort) |
| GC | 2 | GC proposal: struct allocation, tree traversal |
| SIMD | 10 | WAT microbench (4) + C -msimd128 real-world (5), scalar/SIMD |
CI regression detection
PRs are automatically checked for performance regressions:
- 12 representative benchmarks (6 uncached + 6 cached) run on both base and PR branch
- Fails if any benchmark regresses by more than 20%
- Same runner ensures fair comparison
Performance tips
- ReleaseSafe: Always use for production. Debug is 5-10x slower.
- Hot functions: Functions called frequently will be JIT-compiled automatically.
- Fuel limit:
--fueladds overhead per instruction. Only use for untrusted code. - Memory: Wasm modules with linear memory allocate guard pages. Initial RSS is ~3.5 MB regardless of module size.