Qwen3.5-9B on MacBook M5 Scores 93.8% vs GPT-5.4's 97.9% on Security Benchmark
Local AI Closes the Gap with Cloud Models
A new benchmark called HomeSec-Bench demonstrates that small open-source models running entirely on consumer hardware are now competitive with the best cloud APIs on specialized tasks.
Qwen3.5-9B, running locally on a MacBook Pro M5 via llama.cpp, scored 93.8% on a 96-test security evaluation — within 4 points of GPT-5.4's 97.9%. The local model used only 13.8 GB of unified memory and ran at 25 tokens per second with a 765ms time-to-first-token.
The Leaderboard
| Rank | Model | Type | Pass Rate | Time |
|---|---|---|---|---|
| 1 | GPT-5.4 | Cloud | 97.9% | 2m 22s |
| 2 | GPT-5.4-mini | Cloud | 95.8% | 1m 17s |
| 3 | Qwen3.5-9B | Local | 93.8% | 5m 23s |
| 4 | Qwen3.5-27B | Local | 93.8% | 15m 08s |
| 5 | Qwen3.5-122B-MoE | Local | 92.7% | 8m 26s |
| 6 | GPT-5.4-nano | Cloud | 92.7% | 1m 34s |
| 7 | Qwen3.5-35B-MoE | Local | 91.7% | 3m 30s |
Notably, the Qwen3.5-35B-MoE achieved a lower time-to-first-token than all OpenAI cloud models — 435ms vs 508ms for GPT-5.4-nano.
What is HomeSec-Bench?
HomeSec-Bench evaluates LLMs on real home security assistant workflows — not generic chat, but the actual reasoning, triage, and tool use an AI home security system needs. The benchmark covers:
- Tool use and function calling
- Security classification and threat assessment
- Event deduplication
- Image analysis (35 AI-generated fixture images)
Tests run against any OpenAI-compatible endpoint, making it straightforward to benchmark local vs cloud models.
Why This Matters
- Privacy-first use cases are viable — A 9B model on a laptop achieving 93.8% on domain tasks means sensitive applications like home security can run fully offline with complete privacy.
- Cost elimination — Zero API costs for near-GPT-5 performance on specialized tasks. For high-volume applications, this is significant.
- The gap is narrowing fast — Qwen3.5-9B is a freely available model. When a 9B parameter model matches within 4% of the best cloud model, the traditional cloud API value proposition weakens.
- Domain specialization matters — The benchmark targets a specific use case (home security). For narrow domains, smaller models can match general-purpose frontier models.
The Local AI Value Proposition
A 9B model on a laptop scoring within 4% of GPT-5.4 on domain tasks — fully offline with complete privacy — is the value proposition of local AI.
As open-source models continue to improve and Apple Silicon gains more unified memory, the question is not whether local models will catch up on domain tasks — it's how quickly they'll surpass cloud models on cost-adjusted performance.
The system behind the benchmark is Aegis-AI, a local-first AI home security platform running on consumer hardware.