BACK_TO_FEEDAICRIER_2
Qwen3.6 strains single-GPU llama.cpp setups
OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoINFRASTRUCTURE

Qwen3.6 strains single-GPU llama.cpp setups

A LocalLLaMA user is trying to serve Qwen3.6-35B-A3B on an RTX 3090 with llama-server, 128K context, q8 KV cache, flash attention, and automatic `--fit`, but the setup fills nearly all 24GB of VRAM and stutters. The thread is a practical tuning case for local inference, not a product launch.

// ANALYSIS

The interesting bit here is not that Qwen3.6 fails on a 3090, but that modern MoE models make llama.cpp memory budgeting feel like systems work again.

  • The main GGUF is already around the 20-21GB range at Q4_K, so a 24GB card leaves very little room for KV cache, vision projector, CUDA buffers, desktop display use, and batching overhead.
  • `fit-ctx = 131072` plus `cache-type-k/v = q8_0` is the biggest pressure point; dropping context, using lower KV quantization, or leaving more VRAM headroom will likely matter more than sampler tweaks.
  • `fit-target = 3072` may be too tight on a display-attached 3090; local inference usually performs better with several GB free than with every MiB allocated.
  • The BF16 `mmproj` is extra memory cost even when the workload is mostly text; only load it for vision routes, or use a smaller quantized projector if available.
  • Batch and ubatch affect prompt processing speed and memory spikes, but generation stutter usually appears when the model crosses the VRAM cliff.
// TAGS
llama-cppqwen3-6-35b-a3binferencegpuself-hostedopen-weights

DISCOVERED

4h ago

2026-04-22

PUBLISHED

7h ago

2026-04-21

RELEVANCE

7/ 10

AUTHOR

valmist