Rust's New Tail-Call Optimization: How Nightly 'become' Keyword Outperforms Hand-Written Assembly

Available in: 中文

2026-04-05T23:47:29.196Z·2 min read

Traditional interpreter loops suffer from two key bottlenecks: unpredictable dispatch branches (selecting among 256 opcodes) and memory-bound state access. Keeter's journey to optimize the Uxn emul...

Tail-Call Optimization in Rust Reaches New Heights

Developer Matt Keeter has published a fascinating technical deep-dive demonstrating that Rust's new nightly keyword for tail-call optimization can produce a VM interpreter that outperforms both idiomatic Rust implementations and hand-written ARM64 assembly code. The project implements an emulator for the Uxn stack-machine CPU used in the Hundred Rabbits creative computing ecosystem.

The Performance Problem with Interpreters

Original Rust implementation ('Raven') — clean but limited by compiler optimization constraints
Hand-written ARM64 assembly — 40-50% faster using token threading techniques
Hand-written x86-64 assembly — approximately 2× faster, but introduced memory safety bugs
Tail-call Rust implementation — matches assembly performance with safety guarantees

How the Tail-Call Approach Works

The key insight is using Rust's keyword (stabilized in nightly seven months ago via RFC PR #144232) to implement threaded code at the language level:

VM state is stored in function parameters instead of memory
Each opcode handler ends with a tail-call to the next handler
Dispatch is distributed across every opcode, improving branch prediction
The compiler can optimize register allocation across the entire dispatch chain

This approach achieves the same effect as assembly token threading — where each instruction ends with a direct jump to the next — but with Rust's safety guarantees and without maintaining ~2000 lines of unsafe assembly.

Benchmark Results

The tail-call Rust backend serves as a viable substitute for the x86 assembly backend with only minor performance penalties. It significantly outperforms the original loop-based Rust implementation, validating that modern compilers can compete with hand-optimized assembly when given the right abstractions.

Broader Implications for Language Design

This work is part of a broader trend in language design toward first-class tail-call optimization support. As noted by Keeter, 'tailcall-based techniques have been all the rage recently,' with multiple language communities exploring similar approaches.

The project demonstrates that:

Safety and performance need not be in tension
Language-level optimizations can eliminate the need for unsafe assembly
Modern compilers are increasingly capable of matching hand-written low-level code

Controversy Note

Keeter's previous work using Claude Code to assist with the x86-64 assembly port proved controversial on Hacker News. He explicitly states that all tail-call code in this latest work is human-written, and the blog post itself meets his personal AI-generation standards.

Comments0