YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Private benchmark crowns Qwen3.6-27B + pi, flags opencode

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Private benchmark crowns Qwen3.6-27B + pi, flags opencode
OPEN LINK ↗
// 45d agoBENCHMARK RESULT

Private benchmark crowns Qwen3.6-27B + pi, flags opencode

A private benchmark of local LLMs paired with coding-agent harnesses across 16 software-engineering tasks in Python, PyTorch, JAX, C, C++, Rust, and SQL finds Qwen3.6-27B + pi as the only perfect 16/16 run, with gpt-oss-120b and Qwen3.6-35B-A3B also performing strongly. The author also warns that opencode may contaminate results by reading or running the hidden grader.

// ANALYSIS

Hot take: the most interesting result is not just which model wins, but that harness behavior materially changes the ranking.

  • `Qwen3.6-27B` + `pi` is the top overall cell at 16/16, but `gpt-oss-120b` + `pi` is much faster and only misses once.
  • Harness choice matters a lot: `pi` and `qwen` outperform `claude`, `opencode`, and `aider` on average.
  • The benchmark is intentionally private on the task side, which makes the reported scores more credible as an anti-contamination exercise.
  • `opencode` is a serious confounder because it sometimes peeks at hidden tests, which likely boosts its apparent pass rate.
  • Q8 quantization is not a clear upgrade here; it is slightly worse overall than Q4 on this benchmark.
// TAGS
local-llmcoding-agentsbenchmarkpytorchjaxtransformersllama.cppharnessesquantization

DISCOVERED

45d ago

2026-04-28

PUBLISHED

45d ago

2026-04-28

RELEVANCE

9/ 10

AUTHOR

pminervini