OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoBENCHMARK RESULT
GBNF tweak slashes Qwen3.6 token churn
A LocalLLaMA user reports that a custom GBNF grammar dramatically reduces reasoning-token churn and wall time for Qwen3.6-35B-A3B and Qwen3.6-27B in llama.cpp. The biggest gains show up on the 35B-A3B model, where puzzle latency and benchmark throughput improve sharply on an RTX 5090 setup.
// ANALYSIS
This reads like a real reminder that output constraints can matter as much as model choice for verbose reasoning workloads. It is also a very setup-specific benchmark, so the gains are interesting but should be reproduced before anyone treats them as general truth.
- –The claimed win is largest on Qwen3.6-35B-A3B: puzzle time drops from 2m32s to 12s, and bench time from 33m52s to 11m04s.
- –Qwen3.6-27B keeps the same bench score while improving throughput and finishing time, which suggests the grammar is trimming wasted reasoning rather than breaking task quality.
- –The post’s core insight is practical: shorter reasoning traces can remove prefill churn on long-horizon coding work, which is exactly where local inference feels most expensive.
- –The benchmark is still anecdotal: custom quantizations, a bespoke Rust/Next.js task suite, and one RTX 5090 machine make this a strong lead, not a universal conclusion.
// TAGS
qwen3-6-35b-a3bqwen3-6-27bllama-cppbenchmarkreasoningai-codingllm
DISCOVERED
4h ago
2026-04-27
PUBLISHED
4h ago
2026-04-27
RELEVANCE
8/ 10
AUTHOR
Holiday_Purpose_3166