OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT
Qwen3.6-27B regains 16GB fit, 110k context
By reverting a llama.cpp quantization change, the author trims Qwen3.6-27B IQ4_XS back to 14.7GB and keeps it practical on 16GB VRAM. The custom GGUF benchmarks nearly match stock across 65k and 110k context, so this reads like a real deployment fix rather than a quality tradeoff.
// ANALYSIS
This is a niche patch with outsized impact for local model runners: a 0.4GB packaging change is the difference between “fits comfortably” and “doesn’t fit” on consumer 16GB cards.
- –The stock IQ4_XS build is 15.1GB, while the reverted variant lands at 14.7GB, which is enough to preserve the 16GB VRAM use case.
- –Perplexity deltas are tiny at both 65k and 110k context, which supports the claim that the attn_qkv rollback restores size without meaningfully hurting quality.
- –The KV cache tests suggest Qwen3.6-27B does not benefit much from asymmetric K-heavy tuning, so V-cache matters more than the turboquant_plus guidance would imply.
- –The Q3 comparison weakens the “just drop to Q3” argument for coding workflows, since the smaller model still gives up some quality for long-context use.
// TAGS
qwen3.6-27bllmbenchmarkopen-sourceinference
DISCOVERED
3h ago
2026-04-28
PUBLISHED
7h ago
2026-04-28
RELEVANCE
9/ 10
AUTHOR
Pablo_the_brave