BACK_TO_FEEDAICRIER_2
Qwen3.5 local coding benchmark disappoints
OPEN_SOURCE ↗
REDDIT · REDDIT// 20d agoTUTORIAL

Qwen3.5 local coding benchmark disappoints

A LocalLLaMA user ran Qwen3.5-27B and Qwen3.5-35B-A3B through Claude Code on oMLX on an M4 Max 40-core/64GB Mac, but a simple Bomberman prompt still produced unusable code. The thread turns into a practical ask about how to benchmark coding LLMs, which settings matter, and whether the dense 27B or sparse 35B-A3B is the better local pick.

// ANALYSIS

This is more an orchestration problem than a model verdict. For local coding, the agent loop, prompt shape, and sampling defaults matter almost as much as the model family.

  • Qwen3.5's docs recommend conservative coding settings like `temperature=0.6`, `top_p=0.95`, `top_k=20`, and `presence_penalty=0.0`; a generic chat preset can make the same model look far worse than it is.
  • The official Qwen3.5 cards put the dense 27B slightly ahead of the 35B-A3B on SWE-bench Verified, so bigger total parameter counts are not automatically better for coding.
  • Context length helps the model remember repo state, not magically improve reasoning; once the thread balloons, you pay more latency and lose focus.
  • Claude Code can route to local backends, but small models need shorter, test-driven tasks and tighter prompts to stay on rails.
  • If you want a fair benchmark, use edit-run-fix loops with pass/fail tests instead of a one-shot “build me a game” prompt.
// TAGS
qwen3-5llmai-codingagentbenchmarkinferenceself-hostedopen-weights

DISCOVERED

20d ago

2026-03-23

PUBLISHED

20d ago

2026-03-23

RELEVANCE

8/ 10

AUTHOR

shirogeek