Ollama 0.19 MLX: Apple Silicon Local AI Inference Gets a Massive Speed Boost

2026-03-31T11:28:44.415Z·2 min read
Benchmarks conducted on March 29, 2026, using Alibaba's Qwen3.5-35B-A3B model show impressive improvements:

Ollama, the popular open-source tool for running large language models locally, has released version 0.19 with a groundbreaking shift: the entire inference backend on Apple Silicon is now powered by MLX, Apple's machine learning framework. The result is a dramatic speedup across all Apple Silicon devices.

Key Performance Gains

Benchmarks conducted on March 29, 2026, using Alibaba's Qwen3.5-35B-A3B model show impressive improvements:

MetricOllama 0.19 (MLX)Ollama 0.18
Prefill1,810 tok/s1,154 tok/s
Decode112 tok/s58 tok/s

On Apple's M5, M5 Pro, and M5 Max chips, Ollama leverages the new GPU Neural Accelerators to accelerate both time-to-first-token (TTFT) and generation throughput.

NVFP4 Quantization Support

Ollama now supports NVIDIA's NVFP4 format, a low-precision inference standard that:

Improved Caching for Agentic Workloads

The caching system has been overhauled specifically for coding agents and agentic tasks:

Impact on AI Agent Ecosystem

This update directly benefits the local AI agent ecosystem:

Getting Started

Requires a Mac with 32GB+ unified memory:

# Then run:
ollama run qwen3.5:35b-a3b-coding-nvfp4

# With Claude Code:
ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4

This release represents a significant milestone for local AI inference, demonstrating that Apple Silicon can compete with dedicated GPU setups for running large language models.

← Previous: Seedance 2.0 Creates 3 AM Work Culture: Is Off-Peak AI Computing the New Normal?Next: Google Releases TimesFM 2.5: 200M-Parameter Time Series Foundation Model with 16K Context →
Comments0