OPEN_SOURCE ↗
REDDIT · REDDIT// 10h agoINFRASTRUCTURE
Qwen3.5-35B-A3B tests 7900 XTX limits
A LocalLLaMA user is trying to run Qwen3.5-35B-A3B on an RX 7900 XTX with roughly 90K context for coding and tool use, but the quantization and KV-cache budget collide fast. The thread centers on the familiar local-inference tradeoff: keep a larger model, or keep enough context and speed to make it usable.
// ANALYSIS
This is the classic “model size versus usable context” problem, and on 24GB VRAM the cache budget usually wins.
- –Qwen3.5-35B-A3B officially supports long context and tool use, but its own docs warn that extended context matters for reasoning and recommend at least 128K in many setups
- –For a single 7900 XTX, Q4 or higher on a 35B MoE leaves very little headroom for KV cache, which is exactly what a 90K coding workload needs
- –The community answer in the thread is pragmatic: a 27B dense model is often the better fit if you want stable throughput, decent reasoning, and room for long prompts
- –If you stick with 35B-A3B, the realistic fix is not “better quantization” so much as accepting a lower effective context target or using more aggressive serving tricks to reclaim memory
- –For tool calling and coding specifically, a slightly smaller model that stays responsive will usually beat a larger one that constantly chokes on context pressure
// TAGS
qwen3-5-35b-a3bllmai-codingagentinferencegpuself-hosted
DISCOVERED
10h ago
2026-04-17
PUBLISHED
11h ago
2026-04-17
RELEVANCE
8/ 10
AUTHOR
not_NEK0