REDDIT · REDDIT// 4h agoTUTORIAL

LM Studio Qwen 3.6 Crawls at 4 t/s

A LocalLLaMA user reports just 4 tokens/sec from Qwen3.6-27B-Q4_K_M on an RTX 5000 Ada laptop, with LM Studio showing both the Intel iGPU and NVIDIA dGPU. The symptoms point to a setup that is not fully GPU-bound, likely because the 27B dense model is too large for 16GB VRAM and is spilling work into CPU/RAM.

// ANALYSIS

The hot take: this looks less like a broken app and more like a capacity mismatch plus backend selection issue. On this class of hardware, a 27B dense GGUF at Q4_K_M is already on the edge, so once context and runtime overhead kick in, throughput can fall off a cliff.

–The Reddit thread’s consensus is blunt: 27B dense models do not offload cleanly on 16GB VRAM, so partial system-memory execution is expected to be slow.
–The model itself is roughly 16.5GB in Q4_K_M form, which leaves essentially no headroom for KV cache, context, and runtime buffers on a 16GB card.
–LM Studio on Linux can use llama.cpp backends, but if it is picking the wrong backend or prioritizing the iGPU first, performance can degrade sharply; the fix is to verify CUDA/Vulkan selection and confirm the dGPU is actually active.
–The best practical workaround is to drop to a smaller quant or switch to a more GPU-efficient architecture like a 35B A3B/MoE variant, which the commenters suggest will run much better on this machine.
–The user’s CPU hitting 170% while RAM and VRAM stay around 8GB is a strong sign that inference is not staying on the GPU in the way they expect.

// TAGS

lm-studioqwenllmgpuinferencecudaself-hosted

DISCOVERED

4h ago

2026-04-27

PUBLISHED

6h ago

2026-04-27

RELEVANCE

7/ 10

AUTHOR

NorinBlade