OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoINFRASTRUCTURE
Qwen3.6-35B-A3B hits 400 tok/s on H100
A high-performance SGLang setup for Qwen3.6-35B-A3B achieves record-breaking speeds on a single NVIDIA H100 by combining DFlash parallel speculative decoding with FP8 precision. The implementation enables real-time agentic workflows with inference speeds exceeding 400 tokens per second for code generation.
// ANALYSIS
The era of "instant" local LLMs has arrived—35B models are now hitting speeds previously reserved for tiny 3B models, fundamentally changing the latency expectations for developer tools.
- –DFlash speculative decoding is the primary speed driver, using a parallel block diffusion model to predict multiple tokens at once rather than the serial bottleneck of traditional autoregressive drafting.
- –Reaching 400+ tok/s makes this setup a perfect fit for "Claude Code" and other agentic loops where prompt ingestion and rapid-fire token generation determine the "feel" of the developer experience.
- –The use of FP8 weights and KV cache is critical for maximizing the H100's throughput, proving that native FP8 hardware support is now the baseline for high-performance hosting.
- –Qwen3.6's MoE architecture (3B active parameters) hits the "Goldilocks zone" for H100 memory bandwidth, providing high-tier reasoning without the compute overhead of dense 30B+ models.
// TAGS
qwen3.6-35b-a3bllmsglangh100speculative-decodinginferenceopen-weights
DISCOVERED
3h ago
2026-04-28
PUBLISHED
5h ago
2026-04-28
RELEVANCE
8/ 10
AUTHOR
Asleep_Training3543