Qwen3.5 Small hits 8GB VRAM wall
A Reddit user says Qwen3.5 9B on 8GB VRAM OOMs at 8k context with full GPU offload, then only runs at 32k after dropping --ngl to 12, which makes it too slow for work. The thread is really about the tradeoff between model size, context length, and GPU headroom on consumer hardware.
This is the classic local-LLM squeeze: once the weights fit, the KV cache becomes the real memory bill.
- –Qwen3.5-9B is officially a 9B-class model with 262,144 native context, so the limit here is the 8GB card, not the model's advertised window.
- –llama.cpp maintainers note that `-c` directly changes KV buffer size, which is why longer prompts can OOM even when the weights already fit.
- –`--ngl 99` maximizes speed by keeping layers on GPU, but it leaves too little headroom for long-context inference on 8GB.
- –Dropping `--ngl` buys memory for context, but the CPU offload penalty is exactly why the 32k setup feels unusably slow.
DISCOVERED
60d ago
2026-03-29
PUBLISHED
60d ago
2026-03-29
RELEVANCE
AUTHOR
No_Reference_7678