Qwen3.6-35B-A3B Slows Under Long Context
A Reddit user benchmarks Qwen3.6-35B-A3B in LM Studio on a Windows machine with a 16GB GPU and 32GB system RAM, then finds that performance drops sharply under long-context load. With the context window around 72% full, the model takes 77 seconds end-to-end and generates at about 9 tokens/sec, which feels too slow for an iterative coding-agent workflow. The post asks whether llama.cpp/LM Studio tuning can meaningfully improve throughput, and if not, which smaller local coding models are still useful on this hardware.
Hot take: this is less a LM Studio problem than a “35B MoE + long context + consumer VRAM” problem. The model is impressive on paper, but for agentic coding the latency tax from prompt processing and KV cache pressure will dominate quickly.
- –Qwen3.6-35B-A3B is a 35B-total, 3B-active MoE model with native long context, so once you push it toward agent-style prompts, the memory and bandwidth cost rises fast.
- –4-bit KV cache helps fit more context, but it is not a magic speed switch; the big win is usually avoiding huge contexts in the first place.
- –Switching from LM Studio to raw llama.cpp may shave a little overhead, but it will not change the core bottleneck unless your current settings are forcing unnecessary spillover or suboptimal GPU offload.
- –For this hardware, the more practical local coding range is usually 7B to 14B class models, not a 35B model at heavy context.
- –Better bets are Qwen2.5-Coder-7B-Instruct for the fastest interactive loop, Qwen2.5-Coder-14B-Instruct if you can tolerate more latency, and DeepSeek-Coder-V2-Lite-Instruct if you want a MoE-style coding model with lighter active compute and 128K context.
- –If the goal is “usable agent” rather than “largest possible model,” trimming system prompts, lowering context, and choosing a smaller coder model will matter more than frontend changes.
DISCOVERED
2h ago
2026-05-10
PUBLISHED
5h ago
2026-05-10
RELEVANCE
AUTHOR
CodProfessional3712

