OPEN_SOURCE ↗
REDDIT · REDDIT// 20d agoTUTORIAL
Qwen3.5 local coding benchmark disappoints
A LocalLLaMA user ran Qwen3.5-27B and Qwen3.5-35B-A3B through Claude Code on oMLX on an M4 Max 40-core/64GB Mac, but a simple Bomberman prompt still produced unusable code. The thread turns into a practical ask about how to benchmark coding LLMs, which settings matter, and whether the dense 27B or sparse 35B-A3B is the better local pick.
// ANALYSIS
This is more an orchestration problem than a model verdict. For local coding, the agent loop, prompt shape, and sampling defaults matter almost as much as the model family.
- –Qwen3.5's docs recommend conservative coding settings like `temperature=0.6`, `top_p=0.95`, `top_k=20`, and `presence_penalty=0.0`; a generic chat preset can make the same model look far worse than it is.
- –The official Qwen3.5 cards put the dense 27B slightly ahead of the 35B-A3B on SWE-bench Verified, so bigger total parameter counts are not automatically better for coding.
- –Context length helps the model remember repo state, not magically improve reasoning; once the thread balloons, you pay more latency and lose focus.
- –Claude Code can route to local backends, but small models need shorter, test-driven tasks and tighter prompts to stay on rails.
- –If you want a fair benchmark, use edit-run-fix loops with pass/fail tests instead of a one-shot “build me a game” prompt.
// TAGS
qwen3-5llmai-codingagentbenchmarkinferenceself-hostedopen-weights
DISCOVERED
20d ago
2026-03-23
PUBLISHED
20d ago
2026-03-23
RELEVANCE
8/ 10
AUTHOR
shirogeek