YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Qwen3.5 hits VRAM wall, parallelism stalls

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Qwen3.5 hits VRAM wall, parallelism stalls
OPEN LINK ↗
// 68d agoINFRASTRUCTURE

Qwen3.5 hits VRAM wall, parallelism stalls

An engineer running Qwen3.5-35B-A3B in Open WebUI + Ollama on a 32GB RTX 5090 wants enough headroom for two simultaneous chats without tanking technical accuracy. The decision is whether to save memory with KV-cache quantization, cheaper weights, or a smaller context window.

// ANALYSIS

I’d start with Flash Attention plus OLLAMA_KV_CACHE_TYPE=q8_0, keep Q4 weights, and only trim context if two sessions still do not fit. For a V&V/RAMS assistant, preserving base-weight quality matters more than squeezing every last MB out of the model file. Ollama says OLLAMA_NUM_PARALLEL scales memory with parallel requests times context length, so the second 32k prompt is exactly the kind of workload that blows the VRAM budget. Ollama also says q8_0 cuts K/V cache memory to about half of f16 with usually no noticeable quality hit, which is the cleanest way to buy the 2-3GB you need. Dropping to Q3 would reduce precision on every token, which is a worse trade for structured reasoning, calculations, and normative lookups than compressing the cache. Qwen3.5 is already a 35B model with 3B activated and 262k native context, so the bottleneck here is serving economics, not model capability. If q8_0 still leaves you short, I’d step down to 24k before 16k, but I’d treat Q3 as the last resort.

// TAGS
qwen3-5ollamallminferencegpuopen-weightsself-hosted

DISCOVERED

68d ago

2026-03-20

PUBLISHED

68d ago

2026-03-20

RELEVANCE

8/ 10

AUTHOR

DjsantiX