BACK_TO_FEEDAICRIER_2
Jake Benchmark v1 crowns Qwen3.5 27B
OPEN_SOURCE ↗
REDDIT · REDDIT// 19d agoBENCHMARK RESULT

Jake Benchmark v1 crowns Qwen3.5 27B

Jake Benchmark v1 ran 7 local models through 22 real agent tasks on OpenClaw and Ollama, using a Raspberry Pi 5 and an RTX 3090. Qwen3.5:27b-q4_K_M won decisively at 59.4%, while the 35B runner-up stalled at 23.2% and the rest mostly flopped.

// ANALYSIS

The real takeaway is that local agent quality still depends more on tool discovery and execution discipline than on raw parameter count. A quantized 27B beating a 35B by 2.5x is a loud reminder that "bigger" is often the wrong instinct for agent workloads.

  • Medium thinking was the sweet spot for the winner; higher reasoning hurt, which suggests agent loops need crisp tool selection, not endless deliberation.
  • Models that found the `gog` CLI did the work; models that couldn't discover it mostly died under 5%, making tool surfacing the biggest bottleneck.
  • Browser automation was a complete bust across the board, so end-to-end UI agents still need more than a good language model.
  • Security behavior was all over the place, from clean phishing refusal to reckless secret probing, so guardrails can't rely on model instinct alone.
  • OpenClaw advertises Gmail, GitHub, and browser integrations ([openclaw.ai](https://openclaw.ai/)), but this benchmark suggests the model remains the limiting layer.
// TAGS
jake-benchmarkopenclawbenchmarkllmagentautomationclicomputer-use

DISCOVERED

19d ago

2026-03-23

PUBLISHED

19d ago

2026-03-23

RELEVANCE

8/ 10

AUTHOR

Emergency_Ant_843