Hypura: Run LLMs Larger Than Your Mac's Memory Using Storage-Aware Scheduling
Breaking the Memory Barrier on Apple Silicon
Hypura is a new open-source tool that enables running LLMs that exceed a Mac's physical memory by intelligently placing model tensors across GPU, RAM, and NVMe storage tiers based on access patterns and hardware capabilities.
The Problem
Consumer Apple Silicon Macs have fast unified memory and NVMe storage, but limited capacity. A 32 GB M1 Max cannot natively load a 40 GB model — the OS swap-thrashes until the OOM killer intervenes. Standard llama.cpp simply crashes on models that exceed available memory.
How Hypura Works
Hypura reads the GGUF model file, profiles your hardware (GPU working set, RAM, NVMe bandwidth), and solves a placement optimization that assigns every tensor to the optimal tier:
GPU (Metal) — Attention layers, norms, embeddings. Fastest access, limited by recommendedMaxWorkingSetSize.
RAM — Overflow for frequently accessed tensors not on GPU.
NVMe — Dense FFN weights (~60% of model size) stream from NVMe through a dynamically-sized pool buffer.
Key Innovations
- MoE expert optimization: Only 2 of 8 experts fire per token in MoE models. Router interception identifies selected experts and loads only needed strides from NVMe (75% I/O reduction)
- Neuron cache: Tracks loaded expert slices across tokens, achieving 99.5% hit rate from temporal locality
- Co-activation tracking: Predicts which experts will fire next for speculative prefetch
- Models that fit in memory run at full Metal GPU speed with zero overhead
Benchmarks
- 31 GB Mixtral 8x7B on 32 GB Mac Mini → 2.2 tok/s
- 40 GB Llama 70B on 32 GB Mac → 0.3 tok/s
- Both would crash vanilla llama.cpp
Significance
Hypura democratizes large model inference by making it possible to run frontier-scale models on consumer hardware. For researchers, developers, and enthusiasts who can't afford $10,000+ GPU servers, this opens up new possibilities for local AI experimentation.