Qwen3.6-35B-A3B Slows Under Long Context

// 2h agoBENCHMARK RESULT

Qwen3.6-35B-A3B Slows Under Long Context

A Reddit user benchmarks Qwen3.6-35B-A3B in LM Studio on a Windows machine with a 16GB GPU and 32GB system RAM, then finds that performance drops sharply under long-context load. With the context window around 72% full, the model takes 77 seconds end-to-end and generates at about 9 tokens/sec, which feels too slow for an iterative coding-agent workflow. The post asks whether llama.cpp/LM Studio tuning can meaningfully improve throughput, and if not, which smaller local coding models are still useful on this hardware.

// ANALYSIS

Hot take: this is less a LM Studio problem than a “35B MoE + long context + consumer VRAM” problem. The model is impressive on paper, but for agentic coding the latency tax from prompt processing and KV cache pressure will dominate quickly.

–Qwen3.6-35B-A3B is a 35B-total, 3B-active MoE model with native long context, so once you push it toward agent-style prompts, the memory and bandwidth cost rises fast.
–4-bit KV cache helps fit more context, but it is not a magic speed switch; the big win is usually avoiding huge contexts in the first place.
–Switching from LM Studio to raw llama.cpp may shave a little overhead, but it will not change the core bottleneck unless your current settings are forcing unnecessary spillover or suboptimal GPU offload.
–For this hardware, the more practical local coding range is usually 7B to 14B class models, not a 35B model at heavy context.
–Better bets are Qwen2.5-Coder-7B-Instruct for the fastest interactive loop, Qwen2.5-Coder-14B-Instruct if you can tolerate more latency, and DeepSeek-Coder-V2-Lite-Instruct if you want a MoE-style coding model with lighter active compute and 128K context.
–If the goal is “usable agent” rather than “largest possible model,” trimming system prompts, lowering context, and choosing a smaller coder model will matter more than frontend changes.

// TAGS

qwenqwen3.6lm-studiollamacpplocal-firstcoding-agentkv-cachewindowsbenchmarkmoe

DISCOVERED

2h ago

2026-05-10

PUBLISHED

5h ago

2026-05-10

RELEVANCE

8/ 10

AUTHOR

CodProfessional3712

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

BENCHMARK40m ago

Claude Mythos Preview clears METR time-horizon ceiling

Anthropic says an early Claude Mythos Preview snapshot given to METR posts a time horizon more than 2x the next-best model. METR also notes its current suite gets unreliable above 16 hours, so the exact number is less important than the size of the gap.

UPDATE47m ago

Mercury Agent adds Copilot, Codex support

Mercury Agent 1.1.7 adds GitHub Copilot and OpenAI Codex as first-class providers, letting one workflow tap multiple model ecosystems without changing tools. The release also improves live progress tracking, loop detection, and Telegram handling for more controlled long-running agent runs.

OPEN SOURCE1h ago

no-mistakes v1.15.0 adds intent extraction

no-mistakes released v1.15.0 on 2026-05-10. The update adds intent extraction and PR intent sections so validation can use the original agent/session context, and it fixes orphaned intent base SHA handling.