OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoINFRASTRUCTURE
Qwen3.6 Fits RTX 6000 Pro, 96GB
The Reddit thread asks whether an RTX 6000 Pro can comfortably serve Qwen3.6, and the consensus is yes: 96GB VRAM is plenty for the first open-weight Qwen3.6 model. The real constraint is less raw fit and more whether you want room for long context, KV cache, and concurrency.
// ANALYSIS
The hot take: 96GB is not the question here, throughput and serving setup are. For a single-user or small-team local stack, RTX 6000 Pro looks like a very safe target for Qwen3.6-35B-A3B, but the moment you push longer contexts or multiple sessions, memory headroom gets eaten fast.
- –Qwen3.6-35B-A3B is a sparse MoE model with 35B total and 3B active parameters, so it is much easier to host than a dense model of similar headline size.
- –Official Qwen docs say the model is supported in vLLM, SGLang, llama.cpp, and Transformers, which makes local deployment realistic rather than experimental.
- –Reddit replies suggest 96GB can handle BF16, FP8, or Q8 comfortably, with one commenter saying they already run it in full BF16 on 96GB.
- –The thread also hints at the real tradeoff: if you want big context windows and tool use, KV cache and concurrency matter as much as model weights.
- –For a local coding assistant or RAG server, this is probably overkill in a good way; for a multi-user inference box, you still need to benchmark before assuming “fits” means “fast.”
// TAGS
qwen3.6-35b-a3bllminferencegpuself-hostedvrammoe
DISCOVERED
4h ago
2026-04-19
PUBLISHED
8h ago
2026-04-19
RELEVANCE
8/ 10
AUTHOR
Emergency_Brief_9141