Ollama 0.19 MLX: Apple Silicon Local AI Inference Gets a Massive Speed Boost
Ollama, the popular open-source tool for running large language models locally, has released version 0.19 with a groundbreaking shift: the entire inference backend on Apple Silicon is now powered by MLX, Apple's machine learning framework. The result is a dramatic speedup across all Apple Silicon devices.
Key Performance Gains
Benchmarks conducted on March 29, 2026, using Alibaba's Qwen3.5-35B-A3B model show impressive improvements:
| Metric | Ollama 0.19 (MLX) | Ollama 0.18 |
|---|---|---|
| Prefill | 1,810 tok/s | 1,154 tok/s |
| Decode | 112 tok/s | 58 tok/s |
On Apple's M5, M5 Pro, and M5 Max chips, Ollama leverages the new GPU Neural Accelerators to accelerate both time-to-first-token (TTFT) and generation throughput.
NVFP4 Quantization Support
Ollama now supports NVIDIA's NVFP4 format, a low-precision inference standard that:
- Maintains model accuracy while reducing memory bandwidth
- Cuts storage requirements significantly
- Enables parity with production inference environments
- Opens the door to running models optimized by NVIDIA's Model Optimizer tool
Improved Caching for Agentic Workloads
The caching system has been overhauled specifically for coding agents and agentic tasks:
- Cross-conversation cache reuse — Lower memory utilization and more cache hits when branching with tools like Claude Code
- Intelligent checkpoints — Snapshots stored at smart locations in the prompt, reducing prompt processing time
- Smarter eviction — Shared prefixes survive longer even when older branches are dropped
Impact on AI Agent Ecosystem
This update directly benefits the local AI agent ecosystem:
- OpenClaw personal assistants respond faster on Mac
- Claude Code, OpenCode, and Codex coding agents see significant speed improvements
- Users can run Qwen3.5-35B with coding-optimized sampling parameters
Getting Started
Requires a Mac with 32GB+ unified memory:
# Then run:
ollama run qwen3.5:35b-a3b-coding-nvfp4
# With Claude Code:
ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4
This release represents a significant milestone for local AI inference, demonstrating that Apple Silicon can compete with dedicated GPU setups for running large language models.