OPEN_SOURCE ↗
REDDIT · REDDIT// 12d agoINFRASTRUCTURE
Qwen3.5 27B chokes on 8GB VRAM
A Reddit user reports Qwen3.5-27B-Q4_K_M failing in Ollama on an RTX 4060 laptop with 8GB VRAM and 32GB RAM after sending "Hi" and another message. They note Gemma 3 27B still runs, albeit slowly, which suggests a memory and runtime mismatch rather than a simple prompt issue.
// ANALYSIS
This is the classic local-LLM trap: quantized does not mean lightweight enough to ignore memory math, especially once cache and offload are in play.
- –Ollama lists this build at 27.8B parameters and 17GB quantized, so an 8GB mobile GPU is already behind before KV cache or prompt growth. [Ollama](https://ollama.com/library/qwen3.5:27b-q4_K_M)
- –Qwen's own card says the model's default context length is 262,144 tokens and recommends SGLang, vLLM, or KTransformers on multi-GPU setups; it also advises shrinking context when you hit OOM. [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B)
- –Community threads report similar Ollama 500s, crashes, and brutal slowdown on 27B/35B Qwen3.5 builds, with some users only stabilizing things by lowering context or switching runtimes. [r/ollama](https://www.reddit.com/r/ollama/comments/1rgypnv/has_anyone_got_qwen35_to_work_with_ollama/) [r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1rl69p9/running_qwen_35_27b_and_its_super_slow/)
- –Gemma 3 27B QAT models are advertised as using about 3x less memory than non-quantized versions, which helps explain why that family can feel easier to run on the same hardware. [Gemma 3](https://ollama.com/library/gemma3:4b-it-q4_K_M)
- –If local access matters more than raw model size, 4B-9B class models are the practical sweet spot on 8GB VRAM; 27B-class models are better saved for desktop GPUs or hosted inference.
// TAGS
qwen3.5-27bollamallminferencegpuself-hosted
DISCOVERED
12d ago
2026-03-30
PUBLISHED
12d ago
2026-03-30
RELEVANCE
8/ 10
AUTHOR
An0n_A55a551n