OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoBENCHMARK RESULT
Gemma 4 2B-it, Qwen 3.6 Dominate H100
A Reddit benchmark on a single H100 80GB found Gemma 4 2B-it and Qwen 3.6 35B-A3B FP8 were the best serving options among the tested small and mid-size models. The core takeaway is that sparse MoE plus FP8 mattered far more than raw parameter count once concurrency and TTFT entered the picture.
// ANALYSIS
The practical lesson is simple: on one H100, architecture beats size, and dense 30B-class models are the ones that fall apart first under real load.
- –Gemma 4 2B-it led the pack at 16 concurrent users with 3,180 TPS and about 55 ms TTFT, while Gemma 4 31B dense dropped to 226 TPS and 4.1 seconds TTFT
- –Qwen 3.6 35B-A3B in FP8 delivered a 73% throughput gain over BF16, which is a strong signal that MoE models benefit materially from lower precision on H100
- –The dense Qwen 27B pair only saw about a 27% FP8 gain, suggesting smaller dense models are less bottlenecked by expert weight traffic than sparse ones
- –Past low concurrency, dense Gemma 4 31B becomes a latency trap; it may still work for batch or offline use, but it is a poor fit for interactive chat on a single GPU
- –The setup matters: vLLM 0.19.1, 100 prompts, 128 input tokens, 128 output tokens, and four concurrency levels make this a serving benchmark, not a general quality ranking
// TAGS
benchmarkinferencegpullmvllmgemma-4qwen-3.6
DISCOVERED
4h ago
2026-04-25
PUBLISHED
6h ago
2026-04-25
RELEVANCE
8/ 10
AUTHOR
gvij