Qwen3.5-35B-A3B tests 7900 XTX limits

// 90d agoINFRASTRUCTURE

Qwen3.5-35B-A3B tests 7900 XTX limits

A LocalLLaMA user is trying to run Qwen3.5-35B-A3B on an RX 7900 XTX with roughly 90K context for coding and tool use, but the quantization and KV-cache budget collide fast. The thread centers on the familiar local-inference tradeoff: keep a larger model, or keep enough context and speed to make it usable.

// ANALYSIS

This is the classic “model size versus usable context” problem, and on 24GB VRAM the cache budget usually wins.

–Qwen3.5-35B-A3B officially supports long context and tool use, but its own docs warn that extended context matters for reasoning and recommend at least 128K in many setups
–For a single 7900 XTX, Q4 or higher on a 35B MoE leaves very little headroom for KV cache, which is exactly what a 90K coding workload needs
–The community answer in the thread is pragmatic: a 27B dense model is often the better fit if you want stable throughput, decent reasoning, and room for long prompts
–If you stick with 35B-A3B, the realistic fix is not “better quantization” so much as accepting a lower effective context target or using more aggressive serving tricks to reclaim memory
–For tool calling and coding specifically, a slightly smaller model that stays responsive will usually beat a larger one that constantly chokes on context pressure

// TAGS

qwen3-5-35b-a3bllmai-codingagentinferencegpuself-hosted

DISCOVERED

90d ago

2026-04-17

PUBLISHED

90d ago

2026-04-17

RELEVANCE

8/ 10

AUTHOR

not_NEK0

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

TUTORIAL34m ago

Microsoft "ML for Beginners" adds 50+ translations

Microsoft's popular 12-week open-source machine learning curriculum, ML for Beginners, has been updated to offer automated, always up-to-date translations into more than 50 languages, including Arabic, Hindi, and Swahili. This update aims to lower barriers to entry for aspiring machine learning practitioners globally by making the educational content accessible in their native languages.

LAUNCH1h ago

Fly.io launches Sprites, providing stateful and hardware-isolated Linux sandbox environments with fast copy-on-write checkpoint and restore capabilities.

Fly.io has introduced Sprites, which are stateful sandbox environments running in hardware-isolated AWS Firecracker microVMs designed for executing arbitrary, untrusted code or AI agents. Unlike traditional ephemeral serverless functions, Sprites retain their disk state between runs, utilizing a fast NVMe filesystem that continuously syncs to durable external storage. The platform features an ultra-fast copy-on-write checkpoint and restore system taking about 300ms, granular network egress policies using simple domain-level allowlists, and custom port forwarding for public or private service access. Sprites scale to zero and burst dynamically, meaning developers only pay for actual CPU, memory, and written storage usage.

UPDATE2h ago

Inkling model hits Claude Code via Hugging Face

Thinking Machines has made its new 975-billion parameter multimodal Mixture-of-Experts model, Inkling, accessible within Claude Code. This integration is powered by Claude Code's support for Hugging Face inference providers, allowing developers to leverage the new open-weights model for their daily programming workflows.

Qwen3.5-35B-A3B tests 7900 XTX limits