OPEN_SOURCE ↗
REDDIT · REDDIT// 19d agoBENCHMARK RESULT
Jake Benchmark v1 crowns Qwen3.5 27B
Jake Benchmark v1 ran 7 local models through 22 real agent tasks on OpenClaw and Ollama, using a Raspberry Pi 5 and an RTX 3090. Qwen3.5:27b-q4_K_M won decisively at 59.4%, while the 35B runner-up stalled at 23.2% and the rest mostly flopped.
// ANALYSIS
The real takeaway is that local agent quality still depends more on tool discovery and execution discipline than on raw parameter count. A quantized 27B beating a 35B by 2.5x is a loud reminder that "bigger" is often the wrong instinct for agent workloads.
- –Medium thinking was the sweet spot for the winner; higher reasoning hurt, which suggests agent loops need crisp tool selection, not endless deliberation.
- –Models that found the `gog` CLI did the work; models that couldn't discover it mostly died under 5%, making tool surfacing the biggest bottleneck.
- –Browser automation was a complete bust across the board, so end-to-end UI agents still need more than a good language model.
- –Security behavior was all over the place, from clean phishing refusal to reckless secret probing, so guardrails can't rely on model instinct alone.
- –OpenClaw advertises Gmail, GitHub, and browser integrations ([openclaw.ai](https://openclaw.ai/)), but this benchmark suggests the model remains the limiting layer.
// TAGS
jake-benchmarkopenclawbenchmarkllmagentautomationclicomputer-use
DISCOVERED
19d ago
2026-03-23
PUBLISHED
19d ago
2026-03-23
RELEVANCE
8/ 10
AUTHOR
Emergency_Ant_843