Qwen3.5 benchmarks show production-ready MoE on H20 GPUs
New performance tests for Alibaba's Qwen3.5-397B-A17B Mixture-of-Experts (MoE) model on an 8x NVIDIA H20 cluster demonstrate efficient, high-throughput serving for 400B-class models. By leveraging SGLang’s optimized MoE kernels and the H20’s massive 141GB VRAM per card, the setup achieves the memory capacity required for large batch sizes and extended context windows, making it a viable alternative for production environments constrained by hardware availability.
The NVIDIA H20 emerges as an ideal choice for massive MoE inference, trading raw compute for the critical memory headroom needed to keep 400B-class models in-memory without aggressive quantization. SGLang's RadixAttention and specialized kernels significantly reduce the routing overhead inherent in sparse models like Qwen3.5. With 1.1TB of total VRAM, the 8x H20 setup provides the necessary capacity for serving long-context requests up to 262k tokens natively. This benchmark demonstrates that for large MoE models, memory bandwidth and capacity are increasingly more valuable than peak TFLOPS for cost-effective production serving.
DISCOVERED
22d ago
2026-03-21
PUBLISHED
22d ago
2026-03-21
RELEVANCE
AUTHOR
MathematicianNo2877