Qwen3.5-9B on MacBook M5 Scores 93.8% vs GPT-5.4's 97.9% on Security Benchmark

2026-03-20T20:34:30.000Z·2 min read

A new HomeSec-Bench benchmark shows Qwen3.5-9B running locally on a MacBook Pro M5 achieves 93.8% accuracy on security tasks, just 4 points behind GPT-5.4 cloud — with zero API costs and full privacy.

Local AI Closes the Gap with Cloud Models

A new benchmark called HomeSec-Bench demonstrates that small open-source models running entirely on consumer hardware are now competitive with the best cloud APIs on specialized tasks.

Qwen3.5-9B, running locally on a MacBook Pro M5 via llama.cpp, scored 93.8% on a 96-test security evaluation — within 4 points of GPT-5.4's 97.9%. The local model used only 13.8 GB of unified memory and ran at 25 tokens per second with a 765ms time-to-first-token.

The Leaderboard

Rank	Model	Type	Pass Rate	Time
1	GPT-5.4	Cloud	97.9%	2m 22s
2	GPT-5.4-mini	Cloud	95.8%	1m 17s
3	Qwen3.5-9B	Local	93.8%	5m 23s
4	Qwen3.5-27B	Local	93.8%	15m 08s
5	Qwen3.5-122B-MoE	Local	92.7%	8m 26s
6	GPT-5.4-nano	Cloud	92.7%	1m 34s
7	Qwen3.5-35B-MoE	Local	91.7%	3m 30s

Notably, the Qwen3.5-35B-MoE achieved a lower time-to-first-token than all OpenAI cloud models — 435ms vs 508ms for GPT-5.4-nano.

What is HomeSec-Bench?

HomeSec-Bench evaluates LLMs on real home security assistant workflows — not generic chat, but the actual reasoning, triage, and tool use an AI home security system needs. The benchmark covers:

Tool use and function calling
Security classification and threat assessment
Event deduplication
Image analysis (35 AI-generated fixture images)

Tests run against any OpenAI-compatible endpoint, making it straightforward to benchmark local vs cloud models.

Why This Matters

Privacy-first use cases are viable — A 9B model on a laptop achieving 93.8% on domain tasks means sensitive applications like home security can run fully offline with complete privacy.

Cost elimination — Zero API costs for near-GPT-5 performance on specialized tasks. For high-volume applications, this is significant.

The gap is narrowing fast — Qwen3.5-9B is a freely available model. When a 9B parameter model matches within 4% of the best cloud model, the traditional cloud API value proposition weakens.

Domain specialization matters — The benchmark targets a specific use case (home security). For narrow domains, smaller models can match general-purpose frontier models.

The Local AI Value Proposition

A 9B model on a laptop scoring within 4% of GPT-5.4 on domain tasks — fully offline with complete privacy — is the value proposition of local AI.

As open-source models continue to improve and Apple Silicon gains more unified memory, the question is not whether local models will catch up on domain tasks — it's how quickly they'll surpass cloud models on cost-adjusted performance.

The system behind the benchmark is Aegis-AI, a local-first AI home security platform running on consumer hardware.

↗ Original source

ai local ai qwen apple macbook benchmark privacy llm

Comments0