BACK_TO_FEEDAICRIER_2
RTX 6000 x4 build weighs Qwen3.5 models
OPEN_SOURCE ↗
REDDIT · REDDIT// 20d agoINFRASTRUCTURE

RTX 6000 x4 build weighs Qwen3.5 models

A r/LocalLLaMA user with four RTX 6000 Max-Q cards and 768GB RAM is trying to pick the best local models for code auditing, fuzzing, and other security tooling with minimal quality loss. The thread centers on Qwen3.5-122B-A10B and Qwen3.5-397B-A17B, while commenters push a tiered setup instead of one giant model.

// ANALYSIS

Both candidates are MoE models, so active parameters matter more than headline size. The real decision is less "122B vs 397B" and more "which compromise gives you enough quality without making the serving stack too fragile?"

  • Qwen3.5-122B-A10B is 122B total / 10B active, so BF16 is the cleaner quality-first choice for everyday local use: https://huggingface.co/Qwen/Qwen3.5-122B-A10B
  • Qwen3.5-397B-A17B is 397B total / 17B active, which makes Q6_K a sensible fit strategy, but still a deliberate compromise rather than a no-brainer default: https://huggingface.co/Qwen/Qwen3.5-397B-A17B
  • Qwen’s own serving docs lean on current vLLM, SGLang, and KTransformers builds, and vLLM’s `--language-model-only` can free memory for more KV cache if you are not using vision. I’m inferring a 4-GPU setup will want tighter context limits or more aggressive quantization than the docs’ 8-GPU examples show.
  • For fuzzing and code auditing, a smaller task model plus a CPU-side helper is likely to beat trying to force one giant model to do everything.
// TAGS
qwen-3.5llmgpuinferenceopen-weightsself-hostedcode-reviewtesting

DISCOVERED

20d ago

2026-03-22

PUBLISHED

20d ago

2026-03-22

RELEVANCE

8/ 10

AUTHOR

Direct_Bodybuilder63