OPEN_SOURCE ↗
REDDIT · REDDIT// 5h agoBENCHMARK RESULT
GPT-OSS-20B beats Qwen3.6 in coding
A Reddit user compares GPT-OSS-20B and Qwen3.6-35B-A3B on TypeScript and Rust prompts and says Claude Sonnet 4.6 rated the OpenAI model higher. The thread asks whether that reflects real quality, prompt sensitivity, or judge bias.
// ANALYSIS
This reads less like a definitive model ranking and more like a noisy local eval where output style, sampling, and the judge model all matter. GPT-OSS-20B is not ancient either: OpenAI introduced it in August 2025 and says it was trained with a coding- and STEM-heavy focus.
- –OpenAI positions gpt-oss-20b as a local-friendly open-weight model optimized for reasoning, tool use, and coding, with 3.6B active parameters and 16 GB memory targets.
- –Qwen3.6-35B-A3B is also a sparse MoE model aimed at agentic coding, so the gap is more about tuning, prompting, and inference settings than raw parameter count.
- –LLM judges tend to reward clean structure, obvious type safety, and compile-looking code; that can favor one model’s writing style over another’s true correctness.
- –Repeated trials plus “pick the best score” selection makes the comparison shakier, because it amplifies variance instead of measuring central tendency.
- –The useful lesson is that code evals should include compilation, runtime tests, and many runs; single-judge subjective ratings are a weak proxy for actual coding ability.
// TAGS
gpt-oss-20bqwenllmai-codingreasoningbenchmarkopen-source
DISCOVERED
5h ago
2026-04-19
PUBLISHED
6h ago
2026-04-19
RELEVANCE
9/ 10
AUTHOR
kaisellgren