Qwen3.6-35B-A3B hits ~190K context on 8GB VRAM
This post shares a practical local-inference setup for running Qwen3.6-35B-A3B with roughly 190K context on an RTX 4060 8GB laptop paired with 32GB DDR5 RAM over a Tailscale-accessed Linux server. The author reports strong throughput on Q5 GGUF builds, with performance improving further after tuning `ctx-size`, `n-gpu-layers`, `n-cpu-moe`, and TurboQuant KV cache settings in a custom llama.cpp fork.
Hot take: this is less a model announcement than a useful real-world stress test showing how far sparse MoE inference can be pushed on consumer hardware when the memory layout is tuned carefully.
- –The main value is the configuration recipe, not just the raw benchmark numbers, because it shows what actually moved throughput at very large context.
- –The post suggests Q5 materially outperforms Q4 for long-context reasoning on this model family, which is a useful signal for anyone optimizing quality vs speed.
- –TurboQuant KV cache appears to be the key enabler at ~190K context, making the setup much more viable than standard cache behavior.
- –The Linux + DDR5 emphasis is believable and practical: bandwidth, paging behavior, and mmap/mlock choices likely matter more than people expect.
- –`n-cpu-moe` tuning is the most interesting knob here, but the author is still in the exploratory phase rather than presenting a universally optimal value.
DISCOVERED
3h ago
2026-05-10
PUBLISHED
5h ago
2026-05-10
RELEVANCE
AUTHOR
Atul_Kumar_97