YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Qwen3.5-35B-A3B tests 7900 XTX limits

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Qwen3.5-35B-A3B tests 7900 XTX limits
OPEN LINK ↗
// 45d agoINFRASTRUCTURE

Qwen3.5-35B-A3B tests 7900 XTX limits

A LocalLLaMA user is trying to run Qwen3.5-35B-A3B on an RX 7900 XTX with roughly 90K context for coding and tool use, but the quantization and KV-cache budget collide fast. The thread centers on the familiar local-inference tradeoff: keep a larger model, or keep enough context and speed to make it usable.

// ANALYSIS

This is the classic “model size versus usable context” problem, and on 24GB VRAM the cache budget usually wins.

  • Qwen3.5-35B-A3B officially supports long context and tool use, but its own docs warn that extended context matters for reasoning and recommend at least 128K in many setups
  • For a single 7900 XTX, Q4 or higher on a 35B MoE leaves very little headroom for KV cache, which is exactly what a 90K coding workload needs
  • The community answer in the thread is pragmatic: a 27B dense model is often the better fit if you want stable throughput, decent reasoning, and room for long prompts
  • If you stick with 35B-A3B, the realistic fix is not “better quantization” so much as accepting a lower effective context target or using more aggressive serving tricks to reclaim memory
  • For tool calling and coding specifically, a slightly smaller model that stays responsive will usually beat a larger one that constantly chokes on context pressure
// TAGS
qwen3-5-35b-a3bllmai-codingagentinferencegpuself-hosted

DISCOVERED

45d ago

2026-04-17

PUBLISHED

45d ago

2026-04-17

RELEVANCE

8/ 10

AUTHOR

not_NEK0