Qwen3.5-9B on MacBook M5 Scores 93.8% vs GPT-5.4's 97.9% on Security Benchmark

2026-03-20T20:34:30.000Z·2 min read
A new HomeSec-Bench benchmark shows Qwen3.5-9B running locally on a MacBook Pro M5 achieves 93.8% accuracy on security tasks, just 4 points behind GPT-5.4 cloud — with zero API costs and full privacy.

Local AI Closes the Gap with Cloud Models

A new benchmark called HomeSec-Bench demonstrates that small open-source models running entirely on consumer hardware are now competitive with the best cloud APIs on specialized tasks.

Qwen3.5-9B, running locally on a MacBook Pro M5 via llama.cpp, scored 93.8% on a 96-test security evaluation — within 4 points of GPT-5.4's 97.9%. The local model used only 13.8 GB of unified memory and ran at 25 tokens per second with a 765ms time-to-first-token.

The Leaderboard

RankModelTypePass RateTime
1GPT-5.4Cloud97.9%2m 22s
2GPT-5.4-miniCloud95.8%1m 17s
3Qwen3.5-9BLocal93.8%5m 23s
4Qwen3.5-27BLocal93.8%15m 08s
5Qwen3.5-122B-MoELocal92.7%8m 26s
6GPT-5.4-nanoCloud92.7%1m 34s
7Qwen3.5-35B-MoELocal91.7%3m 30s

Notably, the Qwen3.5-35B-MoE achieved a lower time-to-first-token than all OpenAI cloud models — 435ms vs 508ms for GPT-5.4-nano.

What is HomeSec-Bench?

HomeSec-Bench evaluates LLMs on real home security assistant workflows — not generic chat, but the actual reasoning, triage, and tool use an AI home security system needs. The benchmark covers:

Tests run against any OpenAI-compatible endpoint, making it straightforward to benchmark local vs cloud models.

Why This Matters

  1. Privacy-first use cases are viable — A 9B model on a laptop achieving 93.8% on domain tasks means sensitive applications like home security can run fully offline with complete privacy.
  1. Cost elimination — Zero API costs for near-GPT-5 performance on specialized tasks. For high-volume applications, this is significant.
  1. The gap is narrowing fast — Qwen3.5-9B is a freely available model. When a 9B parameter model matches within 4% of the best cloud model, the traditional cloud API value proposition weakens.
  1. Domain specialization matters — The benchmark targets a specific use case (home security). For narrow domains, smaller models can match general-purpose frontier models.

The Local AI Value Proposition

A 9B model on a laptop scoring within 4% of GPT-5.4 on domain tasks — fully offline with complete privacy — is the value proposition of local AI.

As open-source models continue to improve and Apple Silicon gains more unified memory, the question is not whether local models will catch up on domain tasks — it's how quickly they'll surpass cloud models on cost-adjusted performance.

The system behind the benchmark is Aegis-AI, a local-first AI home security platform running on consumer hardware.

↗ Original source
← Previous: China's Yeshu Group Seeks 50 Humanoid Robots for Coconut ProcessingNext: Super Micro Shares Plunge 25% After Co-Founder Charged in $2.5B AI Chip Smuggling Plot →
Comments0