OPEN_SOURCE ↗
REDDIT · REDDIT// 20d agoINFRASTRUCTURE
RTX 6000 x4 build weighs Qwen3.5 models
A r/LocalLLaMA user with four RTX 6000 Max-Q cards and 768GB RAM is trying to pick the best local models for code auditing, fuzzing, and other security tooling with minimal quality loss. The thread centers on Qwen3.5-122B-A10B and Qwen3.5-397B-A17B, while commenters push a tiered setup instead of one giant model.
// ANALYSIS
Both candidates are MoE models, so active parameters matter more than headline size. The real decision is less "122B vs 397B" and more "which compromise gives you enough quality without making the serving stack too fragile?"
- –Qwen3.5-122B-A10B is 122B total / 10B active, so BF16 is the cleaner quality-first choice for everyday local use: https://huggingface.co/Qwen/Qwen3.5-122B-A10B
- –Qwen3.5-397B-A17B is 397B total / 17B active, which makes Q6_K a sensible fit strategy, but still a deliberate compromise rather than a no-brainer default: https://huggingface.co/Qwen/Qwen3.5-397B-A17B
- –Qwen’s own serving docs lean on current vLLM, SGLang, and KTransformers builds, and vLLM’s `--language-model-only` can free memory for more KV cache if you are not using vision. I’m inferring a 4-GPU setup will want tighter context limits or more aggressive quantization than the docs’ 8-GPU examples show.
- –For fuzzing and code auditing, a smaller task model plus a CPU-side helper is likely to beat trying to force one giant model to do everything.
// TAGS
qwen-3.5llmgpuinferenceopen-weightsself-hostedcode-reviewtesting
DISCOVERED
20d ago
2026-03-22
PUBLISHED
20d ago
2026-03-22
RELEVANCE
8/ 10
AUTHOR
Direct_Bodybuilder63