Qwen3.6-27B codes well, crawls past 52K tokens
This field report tests Qwen3.6-27B, running the 4-bit IQ4_XS Unsloth quant through llama-server on an M2 MacBook Pro with 32GB RAM. The author says the model produced excellent code in OpenCode, but throughput collapsed as context grew from 80 tok/s prompt processing and 7.9 tok/s generation to just 4 tok/s prompt processing and 3.1 tok/s generation at roughly 52,000 tokens. No swapping was observed, so the slowdown appears to be memory-bandwidth bound rather than a RAM-capacity problem. Speculative ngram-mod decoding did not seem to help materially.
Hot take: this reads like a reminder that a dense 27B model can be genuinely useful locally, but the hardware ceiling is real and ugly.
- –The model quality appears high enough to justify the pain: the generated code was described as excellent even with no extra steering after the initial prompt.
- –The bottleneck is likely bandwidth, not memory pressure, which makes this a hardware-fit story more than a software-tuning story.
- –The ngram-mod speculative setup seems to have added complexity without clear payoff in this workload.
- –The author’s “slow but effective self-hosted Sonnet” framing is plausible for dense 27B-class models.
- –This is a practical report, not a synthetic benchmark: the value is in the real agentic coding behavior under long context.
DISCOVERED
4h ago
2026-04-29
PUBLISHED
7h ago
2026-04-28
RELEVANCE
AUTHOR
boutell