Qwen3.5 hits VRAM wall, parallelism stalls

// 114d agoINFRASTRUCTURE

Qwen3.5 hits VRAM wall, parallelism stalls

An engineer running Qwen3.5-35B-A3B in Open WebUI + Ollama on a 32GB RTX 5090 wants enough headroom for two simultaneous chats without tanking technical accuracy. The decision is whether to save memory with KV-cache quantization, cheaper weights, or a smaller context window.

// ANALYSIS

I’d start with Flash Attention plus OLLAMA_KV_CACHE_TYPE=q8_0, keep Q4 weights, and only trim context if two sessions still do not fit. For a V&V/RAMS assistant, preserving base-weight quality matters more than squeezing every last MB out of the model file. Ollama says OLLAMA_NUM_PARALLEL scales memory with parallel requests times context length, so the second 32k prompt is exactly the kind of workload that blows the VRAM budget. Ollama also says q8_0 cuts K/V cache memory to about half of f16 with usually no noticeable quality hit, which is the cleanest way to buy the 2-3GB you need. Dropping to Q3 would reduce precision on every token, which is a worse trade for structured reasoning, calculations, and normative lookups than compressing the cache. Qwen3.5 is already a 35B model with 3B activated and 262k native context, so the bottleneck here is serving economics, not model capability. If q8_0 still leaves you short, I’d step down to 24k before 16k, but I’d treat Q3 as the last resort.

// TAGS

qwen3-5ollamallminferencegpuopen-weightsself-hosted

DISCOVERED

114d ago

2026-03-20

PUBLISHED

115d ago

2026-03-20

RELEVANCE

8/ 10

AUTHOR

DjsantiX

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE5m ago

Win11Debloat declutters Windows 10 and 11

Win11Debloat is a lightweight, customizable PowerShell script to declutter, optimize, and customize Windows 10 and 11. It allows users to remove pre-installed bloatware apps, disable telemetry, adjust privacy settings, and tweak user interface elements through an interactive menu or command-line arguments.

RESEARCH31m ago

Smart Cellular Bricks achieve decentralized self-repair

A new Nature Communications paper by researchers from the IT University of Copenhagen, Sakana AI, and Autodesk introduces Smart Cellular Bricks, a modular 3D system capable of shape classification and self-repair. Running a decentralized Neural Cellular Automata model, the individual bricks communicate only with immediate neighbors to collectively coordinate recovery without a central controller.

UPDATE1h ago

OpenDesign integrates Meta Muse Spark API

OpenDesign is an open-source, local-first design workspace that can be paired with Meta's Muse Spark to generate code-ready prototypes and UI screens directly from screenshots and prompts. This integration bridges the gap between visual design and software development, providing developers with an interactive workspace to rapidly iterate on AI-generated user interfaces.