BACK_TO_FEEDAICRIER_2
Qwen3.6-35B-A3B hits 400 tok/s on H100
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoINFRASTRUCTURE

Qwen3.6-35B-A3B hits 400 tok/s on H100

A high-performance SGLang setup for Qwen3.6-35B-A3B achieves record-breaking speeds on a single NVIDIA H100 by combining DFlash parallel speculative decoding with FP8 precision. The implementation enables real-time agentic workflows with inference speeds exceeding 400 tokens per second for code generation.

// ANALYSIS

The era of "instant" local LLMs has arrived—35B models are now hitting speeds previously reserved for tiny 3B models, fundamentally changing the latency expectations for developer tools.

  • DFlash speculative decoding is the primary speed driver, using a parallel block diffusion model to predict multiple tokens at once rather than the serial bottleneck of traditional autoregressive drafting.
  • Reaching 400+ tok/s makes this setup a perfect fit for "Claude Code" and other agentic loops where prompt ingestion and rapid-fire token generation determine the "feel" of the developer experience.
  • The use of FP8 weights and KV cache is critical for maximizing the H100's throughput, proving that native FP8 hardware support is now the baseline for high-performance hosting.
  • Qwen3.6's MoE architecture (3B active parameters) hits the "Goldilocks zone" for H100 memory bandwidth, providing high-tier reasoning without the compute overhead of dense 30B+ models.
// TAGS
qwen3.6-35b-a3bllmsglangh100speculative-decodinginferenceopen-weights

DISCOVERED

3h ago

2026-04-28

PUBLISHED

5h ago

2026-04-28

RELEVANCE

8/ 10

AUTHOR

Asleep_Training3543