Jake Benchmark v1 crowns Qwen3.5 27B

// 66d agoBENCHMARK RESULT

Jake Benchmark v1 crowns Qwen3.5 27B

Jake Benchmark v1 ran 7 local models through 22 real agent tasks on OpenClaw and Ollama, using a Raspberry Pi 5 and an RTX 3090. Qwen3.5:27b-q4_K_M won decisively at 59.4%, while the 35B runner-up stalled at 23.2% and the rest mostly flopped.

// ANALYSIS

The real takeaway is that local agent quality still depends more on tool discovery and execution discipline than on raw parameter count. A quantized 27B beating a 35B by 2.5x is a loud reminder that "bigger" is often the wrong instinct for agent workloads.

–Medium thinking was the sweet spot for the winner; higher reasoning hurt, which suggests agent loops need crisp tool selection, not endless deliberation.
–Models that found the `gog` CLI did the work; models that couldn't discover it mostly died under 5%, making tool surfacing the biggest bottleneck.
–Browser automation was a complete bust across the board, so end-to-end UI agents still need more than a good language model.
–Security behavior was all over the place, from clean phishing refusal to reckless secret probing, so guardrails can't rely on model instinct alone.
–OpenClaw advertises Gmail, GitHub, and browser integrations ([openclaw.ai](https://openclaw.ai/)), but this benchmark suggests the model remains the limiting layer.

// TAGS

jake-benchmarkopenclawbenchmarkllmagentautomationclicomputer-use

DISCOVERED

66d ago

2026-03-23

PUBLISHED

67d ago

2026-03-23

RELEVANCE

8/ 10

AUTHOR

Emergency_Ant_843

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

VIDEO1d ago

Viral video teases Claude Opus 4.8

A viral video directed by Miguel07Code showcases impressive "hyperframes" camera movements, allegedly generated by Claude Opus 4.8. The post has sparked speculation about Claude's video generation capabilities.

LAUNCH1d ago

Browser Use Terminal launches Rust web-agent TUI

Browser Use Terminal is a new Rust-based TUI that lets developers automate and steer browser tasks directly from the command line. It combines a lightweight LLM harness with direct CDP control over Chrome for highly observable, interactive automation.

NEWS1d ago

Developer automates BTC trading with Claude, nets profit

A developer tasked Claude with a $20 budget to autonomously trade Bitcoin overnight, resulting in a completed script that successfully executed five trades for a $95 profit. The experiment showcases the increasing capability of LLMs to generate functional, profitable algorithmic trading systems with minimal oversight.