Qwen3.5/3.6 Claims Hinge on SWE-bench Setup
This Reddit post is a reproducibility check on the Qwen3.5 and Qwen3.6 benchmark numbers, with the author specifically asking what testing environment was used for SWE-bench Verified. The official Qwen model card shows these results are not “just run the model and score it” numbers: SWE-bench Series uses an internal agent scaffold with bash and file-edit tools, a 200K context window, temp=1.0, and top_p=0.95, while the SWE-bench project itself uses a Docker-based harness for reproducible evaluation. In practice, the benchmark reflects an agent loop plus tooling, not raw chat inference.
Hot take: most “can you reproduce this?” confusion on agent benchmarks comes from comparing raw model runs to a full tool-using evaluation stack.
- –Qwen’s own benchmark notes say SWE-bench Series uses an internal agent scaffold with bash and file-edit tools, not a plain prompt-only setup.
- –The model card also fixes evaluation knobs like temp=1.0, top_p=0.95, and a 200K context window for the SWE-bench series.
- –SWE-bench’s official repo says the benchmark moved to a fully containerized Docker harness for reproducible evaluations.
- –So if someone is trying to match the number locally, they need the same agent framework, tool loop, dataset split, and harness, not just the same checkpoint.
- –The discussion is really about benchmark methodology, not whether the model is “good” or “bad.”
DISCOVERED
4h ago
2026-04-24
PUBLISHED
7h ago
2026-04-24
RELEVANCE
AUTHOR
Leflakk