Qwen3.5 Small hits 8GB VRAM wall

// 106d agoTUTORIAL

Qwen3.5 Small hits 8GB VRAM wall

A Reddit user says Qwen3.5 9B on 8GB VRAM OOMs at 8k context with full GPU offload, then only runs at 32k after dropping --ngl to 12, which makes it too slow for work. The thread is really about the tradeoff between model size, context length, and GPU headroom on consumer hardware.

// ANALYSIS

This is the classic local-LLM squeeze: once the weights fit, the KV cache becomes the real memory bill.

–Qwen3.5-9B is officially a 9B-class model with 262,144 native context, so the limit here is the 8GB card, not the model's advertised window.
–llama.cpp maintainers note that `-c` directly changes KV buffer size, which is why longer prompts can OOM even when the weights already fit.
–`--ngl 99` maximizes speed by keeping layers on GPU, but it leaves too little headroom for long-context inference on 8GB.
–Dropping `--ngl` buys memory for context, but the CPU offload penalty is exactly why the 32k setup feels unusably slow.

// TAGS

llminferencegpuself-hostedopen-sourceqwen3-5-small

DISCOVERED

106d ago

2026-03-29

PUBLISHED

106d ago

2026-03-29

RELEVANCE

8/ 10

AUTHOR

No_Reference_7678

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

LAUNCH12m ago

OpenAI launches ChatGPT Sites for web apps

ChatGPT Sites is a new feature by OpenAI designed to make internal communication more engaging and functional by letting users create custom web apps, trackers, and calculators via simple chat prompts. It eliminates traditional frontend coding and deployment steps, allowing teams to quickly generate and deploy interactive, shareable web pages.

RESEARCH46m ago

A new study demonstrates that visual pretraining on raw visual documents bypasses OCR/parsing limitations and consistently outperforms traditional text-only pretraining for building language intelligence.

Web search paper title}

TUTORIAL1h ago

Tutorial runs MiniMax M3 inside Claude Code

A recent YouTube video explores how developers can integrate the MiniMax M3 model into Claude Code. MiniMax M3 is an open-weight mixture-of-experts (MoE) model that boasts a massive 1-million-token context window and strong performance on coding benchmarks, making it a viable alternative to Claude's native models for users hitting usage constraints.