Qwopus tops local flight sim benchmark
A community benchmark tested nine 8-bit MLX local models on the same single-file browser flight-combat game prompt. Qwopus 3.5 27B ranked first, while the results suggested quant provider, task completion, and emergent implementation choices mattered more than raw parameter count.
This is not a rigorous leaderboard, but it is a useful artifact test because it measures whether models can ship something playable, not just pass canned coding evals.
- –Qwopus winning with fewer prompts and real flight physics is a strong signal for distillation quality on messy creative coding tasks.
- –The three Qwen3.6 35B quant variants behaving differently makes quantization source a practical evaluation dimension, not just a packaging detail.
- –Qwen Coder Next 80B producing the most code but one of the weakest games is a reminder that bigger code output can hide worse product judgment.
- –The wildcard plane choice worked as a lightweight creativity probe, exposing differences that standard benchmark prompts usually miss.
DISCOVERED
45d ago
2026-04-21
PUBLISHED
45d ago
2026-04-21
RELEVANCE
AUTHOR
StudentDifficult8240
