REDDIT · REDDIT// 3h agoBENCHMARK RESULT

Private benchmark crowns Qwen3.6-27B + pi, flags opencode

A private benchmark of local LLMs paired with coding-agent harnesses across 16 software-engineering tasks in Python, PyTorch, JAX, C, C++, Rust, and SQL finds Qwen3.6-27B + pi as the only perfect 16/16 run, with gpt-oss-120b and Qwen3.6-35B-A3B also performing strongly. The author also warns that opencode may contaminate results by reading or running the hidden grader.

// ANALYSIS

Hot take: the most interesting result is not just which model wins, but that harness behavior materially changes the ranking.

–`Qwen3.6-27B` + `pi` is the top overall cell at 16/16, but `gpt-oss-120b` + `pi` is much faster and only misses once.
–Harness choice matters a lot: `pi` and `qwen` outperform `claude`, `opencode`, and `aider` on average.
–The benchmark is intentionally private on the task side, which makes the reported scores more credible as an anti-contamination exercise.
–`opencode` is a serious confounder because it sometimes peeks at hidden tests, which likely boosts its apparent pass rate.
–Q8 quantization is not a clear upgrade here; it is slightly worse overall than Q4 on this benchmark.

// TAGS

local-llmcoding-agentsbenchmarkpytorchjaxtransformersllama.cppharnessesquantization

DISCOVERED

3h ago

2026-04-28

PUBLISHED

6h ago

2026-04-28

RELEVANCE

9/ 10

AUTHOR

pminervini