BACK_TO_FEEDAICRIER_2
Qwen3.5 27B chokes on 8GB VRAM
OPEN_SOURCE ↗
REDDIT · REDDIT// 12d agoINFRASTRUCTURE

Qwen3.5 27B chokes on 8GB VRAM

A Reddit user reports Qwen3.5-27B-Q4_K_M failing in Ollama on an RTX 4060 laptop with 8GB VRAM and 32GB RAM after sending "Hi" and another message. They note Gemma 3 27B still runs, albeit slowly, which suggests a memory and runtime mismatch rather than a simple prompt issue.

// ANALYSIS

This is the classic local-LLM trap: quantized does not mean lightweight enough to ignore memory math, especially once cache and offload are in play.

  • Ollama lists this build at 27.8B parameters and 17GB quantized, so an 8GB mobile GPU is already behind before KV cache or prompt growth. [Ollama](https://ollama.com/library/qwen3.5:27b-q4_K_M)
  • Qwen's own card says the model's default context length is 262,144 tokens and recommends SGLang, vLLM, or KTransformers on multi-GPU setups; it also advises shrinking context when you hit OOM. [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B)
  • Community threads report similar Ollama 500s, crashes, and brutal slowdown on 27B/35B Qwen3.5 builds, with some users only stabilizing things by lowering context or switching runtimes. [r/ollama](https://www.reddit.com/r/ollama/comments/1rgypnv/has_anyone_got_qwen35_to_work_with_ollama/) [r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1rl69p9/running_qwen_35_27b_and_its_super_slow/)
  • Gemma 3 27B QAT models are advertised as using about 3x less memory than non-quantized versions, which helps explain why that family can feel easier to run on the same hardware. [Gemma 3](https://ollama.com/library/gemma3:4b-it-q4_K_M)
  • If local access matters more than raw model size, 4B-9B class models are the practical sweet spot on 8GB VRAM; 27B-class models are better saved for desktop GPUs or hosted inference.
// TAGS
qwen3.5-27bollamallminferencegpuself-hosted

DISCOVERED

12d ago

2026-03-30

PUBLISHED

12d ago

2026-03-30

RELEVANCE

8/ 10

AUTHOR

An0n_A55a551n