BACK_TO_FEEDAICRIER_2
Qwen3.6-35B-A3B hits 30 t/s at 128K
OPEN_SOURCE ↗
REDDIT · REDDIT// 6h agoBENCHMARK RESULT

Qwen3.6-35B-A3B hits 30 t/s at 128K

On an RTX 5080 16GB, this local llama.cpp setup keeps Qwen3.6-35B-A3B usable at 128K context, reaching about 30 t/s with a hybrid KV and expert-offload configuration. The post argues that for coding agents, context depth and memory placement matter more than raw d=0 speed or denser quantization.

// ANALYSIS

The punchline is that long-context local coding is no longer a “maybe someday” story on a single consumer GPU, but a tuning problem with sharp cliffs. Once the author found the right MoE offload balance and KV layout, Qwen3.6-35B-A3B became fast enough to keep a Claude Code-style workflow practical.

  • The key benchmark is the depth curve: the model stays strong as context grows, instead of collapsing the way the dense 27B setup did.
  • The post makes a convincing case that file size and offload balance beat “better” quant labels if they let more experts stay on GPU.
  • The eval lesson matters: self-written tests overstated quality differences until the author switched to one shared harness and deterministic sampling.
  • The hybrid KV finding is the most interesting systems result here: compressing everything was slower than selectively promoting hot layers to q8_0 at long context.
  • For local agent users, the real threshold is not peak tokens/sec at empty context; it is whether the model stays responsive at the conversation depth you actually work in.
// TAGS
qwen3-6-35b-a3bllmai-codingagentinferencegpuopen-source

DISCOVERED

6h ago

2026-05-01

PUBLISHED

10h ago

2026-04-30

RELEVANCE

9/ 10

AUTHOR

craftogrammer