GPT-OSS-20B beats Qwen3.6 in coding
A Reddit user compares GPT-OSS-20B and Qwen3.6-35B-A3B on TypeScript and Rust prompts and says Claude Sonnet 4.6 rated the OpenAI model higher. The thread asks whether that reflects real quality, prompt sensitivity, or judge bias.
This reads less like a definitive model ranking and more like a noisy local eval where output style, sampling, and the judge model all matter. GPT-OSS-20B is not ancient either: OpenAI introduced it in August 2025 and says it was trained with a coding- and STEM-heavy focus.
- –OpenAI positions gpt-oss-20b as a local-friendly open-weight model optimized for reasoning, tool use, and coding, with 3.6B active parameters and 16 GB memory targets.
- –Qwen3.6-35B-A3B is also a sparse MoE model aimed at agentic coding, so the gap is more about tuning, prompting, and inference settings than raw parameter count.
- –LLM judges tend to reward clean structure, obvious type safety, and compile-looking code; that can favor one model’s writing style over another’s true correctness.
- –Repeated trials plus “pick the best score” selection makes the comparison shakier, because it amplifies variance instead of measuring central tendency.
- –The useful lesson is that code evals should include compilation, runtime tests, and many runs; single-judge subjective ratings are a weak proxy for actual coding ability.
DISCOVERED
45d ago
2026-04-19
PUBLISHED
45d ago
2026-04-19
RELEVANCE
AUTHOR
kaisellgren