BACK_TO_FEEDAICRIER_2
Qwen3.5-35B-A3B tests 7900 XTX limits
OPEN_SOURCE ↗
REDDIT · REDDIT// 10h agoINFRASTRUCTURE

Qwen3.5-35B-A3B tests 7900 XTX limits

A LocalLLaMA user is trying to run Qwen3.5-35B-A3B on an RX 7900 XTX with roughly 90K context for coding and tool use, but the quantization and KV-cache budget collide fast. The thread centers on the familiar local-inference tradeoff: keep a larger model, or keep enough context and speed to make it usable.

// ANALYSIS

This is the classic “model size versus usable context” problem, and on 24GB VRAM the cache budget usually wins.

  • Qwen3.5-35B-A3B officially supports long context and tool use, but its own docs warn that extended context matters for reasoning and recommend at least 128K in many setups
  • For a single 7900 XTX, Q4 or higher on a 35B MoE leaves very little headroom for KV cache, which is exactly what a 90K coding workload needs
  • The community answer in the thread is pragmatic: a 27B dense model is often the better fit if you want stable throughput, decent reasoning, and room for long prompts
  • If you stick with 35B-A3B, the realistic fix is not “better quantization” so much as accepting a lower effective context target or using more aggressive serving tricks to reclaim memory
  • For tool calling and coding specifically, a slightly smaller model that stays responsive will usually beat a larger one that constantly chokes on context pressure
// TAGS
qwen3-5-35b-a3bllmai-codingagentinferencegpuself-hosted

DISCOVERED

10h ago

2026-04-17

PUBLISHED

11h ago

2026-04-17

RELEVANCE

8/ 10

AUTHOR

not_NEK0