GBNF tweak slashes Qwen3.6 token churn
A LocalLLaMA user reports that a custom GBNF grammar dramatically reduces reasoning-token churn and wall time for Qwen3.6-35B-A3B and Qwen3.6-27B in llama.cpp. The biggest gains show up on the 35B-A3B model, where puzzle latency and benchmark throughput improve sharply on an RTX 5090 setup.
This reads like a real reminder that output constraints can matter as much as model choice for verbose reasoning workloads. It is also a very setup-specific benchmark, so the gains are interesting but should be reproduced before anyone treats them as general truth.
- –The claimed win is largest on Qwen3.6-35B-A3B: puzzle time drops from 2m32s to 12s, and bench time from 33m52s to 11m04s.
- –Qwen3.6-27B keeps the same bench score while improving throughput and finishing time, which suggests the grammar is trimming wasted reasoning rather than breaking task quality.
- –The post’s core insight is practical: shorter reasoning traces can remove prefill churn on long-horizon coding work, which is exactly where local inference feels most expensive.
- –The benchmark is still anecdotal: custom quantizations, a bespoke Rust/Next.js task suite, and one RTX 5090 machine make this a strong lead, not a universal conclusion.
DISCOVERED
45d ago
2026-04-27
PUBLISHED
45d ago
2026-04-27
RELEVANCE
AUTHOR
Holiday_Purpose_3166