OPEN_SOURCE ↗
REDDIT · REDDIT// 6h agoBENCHMARK RESULT
Qwen3.6-35B-A3B hits 30 t/s at 128K
On an RTX 5080 16GB, this local llama.cpp setup keeps Qwen3.6-35B-A3B usable at 128K context, reaching about 30 t/s with a hybrid KV and expert-offload configuration. The post argues that for coding agents, context depth and memory placement matter more than raw d=0 speed or denser quantization.
// ANALYSIS
The punchline is that long-context local coding is no longer a “maybe someday” story on a single consumer GPU, but a tuning problem with sharp cliffs. Once the author found the right MoE offload balance and KV layout, Qwen3.6-35B-A3B became fast enough to keep a Claude Code-style workflow practical.
- –The key benchmark is the depth curve: the model stays strong as context grows, instead of collapsing the way the dense 27B setup did.
- –The post makes a convincing case that file size and offload balance beat “better” quant labels if they let more experts stay on GPU.
- –The eval lesson matters: self-written tests overstated quality differences until the author switched to one shared harness and deterministic sampling.
- –The hybrid KV finding is the most interesting systems result here: compressing everything was slower than selectively promoting hot layers to q8_0 at long context.
- –For local agent users, the real threshold is not peak tokens/sec at empty context; it is whether the model stays responsive at the conversation depth you actually work in.
// TAGS
qwen3-6-35b-a3bllmai-codingagentinferencegpuopen-source
DISCOVERED
6h ago
2026-05-01
PUBLISHED
10h ago
2026-04-30
RELEVANCE
9/ 10
AUTHOR
craftogrammer