Ollama 0.19 MLX: Apple Silicon Local AI Inference Gets a Massive Speed Boost

2026-03-31T11:28:44.415Z·2 min read

Benchmarks conducted on March 29, 2026, using Alibaba's Qwen3.5-35B-A3B model show impressive improvements:

Ollama, the popular open-source tool for running large language models locally, has released version 0.19 with a groundbreaking shift: the entire inference backend on Apple Silicon is now powered by MLX, Apple's machine learning framework. The result is a dramatic speedup across all Apple Silicon devices.

Key Performance Gains

Benchmarks conducted on March 29, 2026, using Alibaba's Qwen3.5-35B-A3B model show impressive improvements:

Metric	Ollama 0.19 (MLX)	Ollama 0.18
Prefill	1,810 tok/s	1,154 tok/s
Decode	112 tok/s	58 tok/s

On Apple's M5, M5 Pro, and M5 Max chips, Ollama leverages the new GPU Neural Accelerators to accelerate both time-to-first-token (TTFT) and generation throughput.

NVFP4 Quantization Support

Ollama now supports NVIDIA's NVFP4 format, a low-precision inference standard that:

Maintains model accuracy while reducing memory bandwidth
Cuts storage requirements significantly
Enables parity with production inference environments
Opens the door to running models optimized by NVIDIA's Model Optimizer tool

Improved Caching for Agentic Workloads

The caching system has been overhauled specifically for coding agents and agentic tasks:

Cross-conversation cache reuse — Lower memory utilization and more cache hits when branching with tools like Claude Code
Intelligent checkpoints — Snapshots stored at smart locations in the prompt, reducing prompt processing time
Smarter eviction — Shared prefixes survive longer even when older branches are dropped

Impact on AI Agent Ecosystem

This update directly benefits the local AI agent ecosystem:

OpenClaw personal assistants respond faster on Mac
Claude Code, OpenCode, and Codex coding agents see significant speed improvements
Users can run Qwen3.5-35B with coding-optimized sampling parameters

Getting Started

Requires a Mac with 32GB+ unified memory:

# Then run:
ollama run qwen3.5:35b-a3b-coding-nvfp4

# With Claude Code:
ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4

This release represents a significant milestone for local AI inference, demonstrating that Apple Silicon can compete with dedicated GPU setups for running large language models.

Comments0