Qwen3.6-35B-A3B tops 80 tok/sec in llama.cpp
A Reddit guide shows Qwen3.6-35B-A3B running through llama.cpp MTP on a 12GB RTX 4070 Super and clearing 80 tok/sec in the author’s benchmark. The trick is careful CPU/GPU balancing plus KV-cache quantization, which keeps both throughput and 128K context within reach.
This is a strong local-inference result, but the real story is memory choreography rather than magic hardware. It shows how far sparse MoE plus speculative decoding can stretch a “too big for 12GB” model when the runtime is tuned hard.
- –`-fitt 1664` is doing the heavy lifting by reserving enough VRAM for the draft model and KV cache while letting llama.cpp spill the rest intelligently.
- –The posted 70-82 tok/s range is respectable, but the acceptance rates matter just as much as raw draft-token generation.
- –128K context on 12GB is the more meaningful achievement here; many local setups can go fast only when the prompt stays short.
- –This is not a universal 12GB recipe, especially if the GPU is also driving a display, so real-world headroom will vary.
DISCOVERED
2h ago
2026-05-09
PUBLISHED
3h ago
2026-05-09
RELEVANCE
AUTHOR
janvitos
