Rust's New Tail-Call Optimization: How Nightly 'become' Keyword Outperforms Hand-Written Assembly
Tail-Call Optimization in Rust Reaches New Heights
Developer Matt Keeter has published a fascinating technical deep-dive demonstrating that Rust's new nightly keyword for tail-call optimization can produce a VM interpreter that outperforms both idiomatic Rust implementations and hand-written ARM64 assembly code. The project implements an emulator for the Uxn stack-machine CPU used in the Hundred Rabbits creative computing ecosystem.
The Performance Problem with Interpreters
Traditional interpreter loops suffer from two key bottlenecks: unpredictable dispatch branches (selecting among 256 opcodes) and memory-bound state access. Keeter's journey to optimize the Uxn emulator has gone through multiple stages:
- Original Rust implementation ('Raven') — clean but limited by compiler optimization constraints
- Hand-written ARM64 assembly — 40-50% faster using token threading techniques
- Hand-written x86-64 assembly — approximately 2× faster, but introduced memory safety bugs
- Tail-call Rust implementation — matches assembly performance with safety guarantees
How the Tail-Call Approach Works
The key insight is using Rust's keyword (stabilized in nightly seven months ago via RFC PR #144232) to implement threaded code at the language level:
- VM state is stored in function parameters instead of memory
- Each opcode handler ends with a tail-call to the next handler
- Dispatch is distributed across every opcode, improving branch prediction
- The compiler can optimize register allocation across the entire dispatch chain
This approach achieves the same effect as assembly token threading — where each instruction ends with a direct jump to the next — but with Rust's safety guarantees and without maintaining ~2000 lines of unsafe assembly.
Benchmark Results
The tail-call Rust backend serves as a viable substitute for the x86 assembly backend with only minor performance penalties. It significantly outperforms the original loop-based Rust implementation, validating that modern compilers can compete with hand-optimized assembly when given the right abstractions.
Broader Implications for Language Design
This work is part of a broader trend in language design toward first-class tail-call optimization support. As noted by Keeter, 'tailcall-based techniques have been all the rage recently,' with multiple language communities exploring similar approaches.
The project demonstrates that:
- Safety and performance need not be in tension
- Language-level optimizations can eliminate the need for unsafe assembly
- Modern compilers are increasingly capable of matching hand-written low-level code
Controversy Note
Keeter's previous work using Claude Code to assist with the x86-64 assembly port proved controversial on Hacker News. He explicitly states that all tail-call code in this latest work is human-written, and the blog post itself meets his personal AI-generation standards.