Qwen3.5 hits VRAM wall, parallelism stalls
An engineer running Qwen3.5-35B-A3B in Open WebUI + Ollama on a 32GB RTX 5090 wants enough headroom for two simultaneous chats without tanking technical accuracy. The decision is whether to save memory with KV-cache quantization, cheaper weights, or a smaller context window.
I’d start with Flash Attention plus OLLAMA_KV_CACHE_TYPE=q8_0, keep Q4 weights, and only trim context if two sessions still do not fit. For a V&V/RAMS assistant, preserving base-weight quality matters more than squeezing every last MB out of the model file. Ollama says OLLAMA_NUM_PARALLEL scales memory with parallel requests times context length, so the second 32k prompt is exactly the kind of workload that blows the VRAM budget. Ollama also says q8_0 cuts K/V cache memory to about half of f16 with usually no noticeable quality hit, which is the cleanest way to buy the 2-3GB you need. Dropping to Q3 would reduce precision on every token, which is a worse trade for structured reasoning, calculations, and normative lookups than compressing the cache. Qwen3.5 is already a 35B model with 3B activated and 262k native context, so the bottleneck here is serving economics, not model capability. If q8_0 still leaves you short, I’d step down to 24k before 16k, but I’d treat Q3 as the last resort.
DISCOVERED
22d ago
2026-03-20
PUBLISHED
23d ago
2026-03-20
RELEVANCE
AUTHOR
DjsantiX