Qwen3.6-35B-A3B hits 400 tok/s on H100
A high-performance SGLang setup for Qwen3.6-35B-A3B achieves record-breaking speeds on a single NVIDIA H100 by combining DFlash parallel speculative decoding with FP8 precision. The implementation enables real-time agentic workflows with inference speeds exceeding 400 tokens per second for code generation.
The era of "instant" local LLMs has arrived—35B models are now hitting speeds previously reserved for tiny 3B models, fundamentally changing the latency expectations for developer tools.
- –DFlash speculative decoding is the primary speed driver, using a parallel block diffusion model to predict multiple tokens at once rather than the serial bottleneck of traditional autoregressive drafting.
- –Reaching 400+ tok/s makes this setup a perfect fit for "Claude Code" and other agentic loops where prompt ingestion and rapid-fire token generation determine the "feel" of the developer experience.
- –The use of FP8 weights and KV cache is critical for maximizing the H100's throughput, proving that native FP8 hardware support is now the baseline for high-performance hosting.
- –Qwen3.6's MoE architecture (3B active parameters) hits the "Goldilocks zone" for H100 memory bandwidth, providing high-tier reasoning without the compute overhead of dense 30B+ models.
DISCOVERED
45d ago
2026-04-28
PUBLISHED
45d ago
2026-04-28
RELEVANCE
AUTHOR
Asleep_Training3543