Hypura: Run LLMs Larger Than Your Mac's Memory Using Storage-Aware Scheduling

2026-03-25T04:07:35.527Z·2 min read
Hypura is a new open-source tool that enables running LLMs that exceed a Mac's physical memory by intelligently placing model tensors across GPU, RAM, and NVMe storage tiers based on access pattern...

Breaking the Memory Barrier on Apple Silicon

Hypura is a new open-source tool that enables running LLMs that exceed a Mac's physical memory by intelligently placing model tensors across GPU, RAM, and NVMe storage tiers based on access patterns and hardware capabilities.

The Problem

Consumer Apple Silicon Macs have fast unified memory and NVMe storage, but limited capacity. A 32 GB M1 Max cannot natively load a 40 GB model — the OS swap-thrashes until the OOM killer intervenes. Standard llama.cpp simply crashes on models that exceed available memory.

How Hypura Works

Hypura reads the GGUF model file, profiles your hardware (GPU working set, RAM, NVMe bandwidth), and solves a placement optimization that assigns every tensor to the optimal tier:

GPU (Metal) — Attention layers, norms, embeddings. Fastest access, limited by recommendedMaxWorkingSetSize.

RAM — Overflow for frequently accessed tensors not on GPU.

NVMe — Dense FFN weights (~60% of model size) stream from NVMe through a dynamically-sized pool buffer.

Key Innovations

Benchmarks

Significance

Hypura democratizes large model inference by making it possible to run frontier-scale models on consumer hardware. For researchers, developers, and enthusiasts who can't afford $10,000+ GPU servers, this opens up new possibilities for local AI experimentation.

← Previous: Wine 11 Rewrites Windows Game Execution at Kernel Level for Massive Speed GainsNext: App Screenshot Watermarking: Can Platforms Track Your Identity from Screenshots? →
Comments0