Gemma 4 2B-it, Qwen 3.6 Dominate H100

// 52d agoBENCHMARK RESULT

Gemma 4 2B-it, Qwen 3.6 Dominate H100

A Reddit benchmark on a single H100 80GB found Gemma 4 2B-it and Qwen 3.6 35B-A3B FP8 were the best serving options among the tested small and mid-size models. The core takeaway is that sparse MoE plus FP8 mattered far more than raw parameter count once concurrency and TTFT entered the picture.

// ANALYSIS

The practical lesson is simple: on one H100, architecture beats size, and dense 30B-class models are the ones that fall apart first under real load.

–Gemma 4 2B-it led the pack at 16 concurrent users with 3,180 TPS and about 55 ms TTFT, while Gemma 4 31B dense dropped to 226 TPS and 4.1 seconds TTFT
–Qwen 3.6 35B-A3B in FP8 delivered a 73% throughput gain over BF16, which is a strong signal that MoE models benefit materially from lower precision on H100
–The dense Qwen 27B pair only saw about a 27% FP8 gain, suggesting smaller dense models are less bottlenecked by expert weight traffic than sparse ones
–Past low concurrency, dense Gemma 4 31B becomes a latency trap; it may still work for batch or offline use, but it is a poor fit for interactive chat on a single GPU
–The setup matters: vLLM 0.19.1, 100 prompts, 128 input tokens, 128 output tokens, and four concurrency levels make this a serving benchmark, not a general quality ranking

// TAGS

benchmarkinferencegpullmvllmgemma-4qwen-3.6

DISCOVERED

52d ago

2026-04-25

PUBLISHED

53d ago

2026-04-25

RELEVANCE

8/ 10

AUTHOR

gvij

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE30m ago

Grok hits Microsoft Office as AI agent sidebar

xAI's Grok is now integrated directly into Microsoft Office applications. Users can access a Grok agent in the sidebar of PowerPoint, Word, and Excel to automatically build decks, spreadsheets, and documents within their existing workflow.

VIDEO56m ago

Microsoft Foundry livestream explores distributed agentic applications

The announced livestream will dive into the technical details of building agentic applications. It aims to explore patterns for composing applications across various services, strategies for agent coordination in distributed architectures, and how these architectural choices can be applied to shape Foundry-oriented scenarios.

NEWS1h ago

Claude product head: Context drives model performance

During a presentation in Tokyo, Angela Jang, head of product for Anthropic's Claude platform, highlighted that many builders are missing a crucial point: a model's effectiveness is entirely dependent on the context it is given. She emphasized that the focus has shifted from mere prompting to providing comprehensive context, which she described as "the whole game now."