OPEN_SOURCE ↗
REDDIT · REDDIT// 14d agoTUTORIAL
Qwen3.5 Small hits 8GB VRAM wall
A Reddit user says Qwen3.5 9B on 8GB VRAM OOMs at 8k context with full GPU offload, then only runs at 32k after dropping --ngl to 12, which makes it too slow for work. The thread is really about the tradeoff between model size, context length, and GPU headroom on consumer hardware.
// ANALYSIS
This is the classic local-LLM squeeze: once the weights fit, the KV cache becomes the real memory bill.
- –Qwen3.5-9B is officially a 9B-class model with 262,144 native context, so the limit here is the 8GB card, not the model's advertised window.
- –llama.cpp maintainers note that `-c` directly changes KV buffer size, which is why longer prompts can OOM even when the weights already fit.
- –`--ngl 99` maximizes speed by keeping layers on GPU, but it leaves too little headroom for long-context inference on 8GB.
- –Dropping `--ngl` buys memory for context, but the CPU offload penalty is exactly why the 32k setup feels unusably slow.
// TAGS
llminferencegpuself-hostedopen-sourceqwen3-5-small
DISCOVERED
14d ago
2026-03-29
PUBLISHED
14d ago
2026-03-29
RELEVANCE
8/ 10
AUTHOR
No_Reference_7678