OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoINFRASTRUCTURE
Qwen3.6 strains single-GPU llama.cpp setups
A LocalLLaMA user is trying to serve Qwen3.6-35B-A3B on an RTX 3090 with llama-server, 128K context, q8 KV cache, flash attention, and automatic `--fit`, but the setup fills nearly all 24GB of VRAM and stutters. The thread is a practical tuning case for local inference, not a product launch.
// ANALYSIS
The interesting bit here is not that Qwen3.6 fails on a 3090, but that modern MoE models make llama.cpp memory budgeting feel like systems work again.
- –The main GGUF is already around the 20-21GB range at Q4_K, so a 24GB card leaves very little room for KV cache, vision projector, CUDA buffers, desktop display use, and batching overhead.
- –`fit-ctx = 131072` plus `cache-type-k/v = q8_0` is the biggest pressure point; dropping context, using lower KV quantization, or leaving more VRAM headroom will likely matter more than sampler tweaks.
- –`fit-target = 3072` may be too tight on a display-attached 3090; local inference usually performs better with several GB free than with every MiB allocated.
- –The BF16 `mmproj` is extra memory cost even when the workload is mostly text; only load it for vision routes, or use a smaller quantized projector if available.
- –Batch and ubatch affect prompt processing speed and memory spikes, but generation stutter usually appears when the model crosses the VRAM cliff.
// TAGS
llama-cppqwen3-6-35b-a3binferencegpuself-hostedopen-weights
DISCOVERED
4h ago
2026-04-22
PUBLISHED
7h ago
2026-04-21
RELEVANCE
7/ 10
AUTHOR
valmist