BACK_TO_FEEDAICRIER_2
Qwen3.5 hits VRAM wall, parallelism stalls
OPEN_SOURCE ↗
REDDIT · REDDIT// 22d agoINFRASTRUCTURE

Qwen3.5 hits VRAM wall, parallelism stalls

An engineer running Qwen3.5-35B-A3B in Open WebUI + Ollama on a 32GB RTX 5090 wants enough headroom for two simultaneous chats without tanking technical accuracy. The decision is whether to save memory with KV-cache quantization, cheaper weights, or a smaller context window.

// ANALYSIS

I’d start with Flash Attention plus OLLAMA_KV_CACHE_TYPE=q8_0, keep Q4 weights, and only trim context if two sessions still do not fit. For a V&V/RAMS assistant, preserving base-weight quality matters more than squeezing every last MB out of the model file. Ollama says OLLAMA_NUM_PARALLEL scales memory with parallel requests times context length, so the second 32k prompt is exactly the kind of workload that blows the VRAM budget. Ollama also says q8_0 cuts K/V cache memory to about half of f16 with usually no noticeable quality hit, which is the cleanest way to buy the 2-3GB you need. Dropping to Q3 would reduce precision on every token, which is a worse trade for structured reasoning, calculations, and normative lookups than compressing the cache. Qwen3.5 is already a 35B model with 3B activated and 262k native context, so the bottleneck here is serving economics, not model capability. If q8_0 still leaves you short, I’d step down to 24k before 16k, but I’d treat Q3 as the last resort.

// TAGS
qwen3-5ollamallminferencegpuopen-weightsself-hosted

DISCOVERED

22d ago

2026-03-20

PUBLISHED

23d ago

2026-03-20

RELEVANCE

8/ 10

AUTHOR

DjsantiX