YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Gemma 4 2B-it, Qwen 3.6 Dominate H100

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Gemma 4 2B-it, Qwen 3.6 Dominate H100
OPEN LINK ↗
// 52d agoBENCHMARK RESULT

Gemma 4 2B-it, Qwen 3.6 Dominate H100

A Reddit benchmark on a single H100 80GB found Gemma 4 2B-it and Qwen 3.6 35B-A3B FP8 were the best serving options among the tested small and mid-size models. The core takeaway is that sparse MoE plus FP8 mattered far more than raw parameter count once concurrency and TTFT entered the picture.

// ANALYSIS

The practical lesson is simple: on one H100, architecture beats size, and dense 30B-class models are the ones that fall apart first under real load.

  • Gemma 4 2B-it led the pack at 16 concurrent users with 3,180 TPS and about 55 ms TTFT, while Gemma 4 31B dense dropped to 226 TPS and 4.1 seconds TTFT
  • Qwen 3.6 35B-A3B in FP8 delivered a 73% throughput gain over BF16, which is a strong signal that MoE models benefit materially from lower precision on H100
  • The dense Qwen 27B pair only saw about a 27% FP8 gain, suggesting smaller dense models are less bottlenecked by expert weight traffic than sparse ones
  • Past low concurrency, dense Gemma 4 31B becomes a latency trap; it may still work for batch or offline use, but it is a poor fit for interactive chat on a single GPU
  • The setup matters: vLLM 0.19.1, 100 prompts, 128 input tokens, 128 output tokens, and four concurrency levels make this a serving benchmark, not a general quality ranking
// TAGS
benchmarkinferencegpullmvllmgemma-4qwen-3.6

DISCOVERED

52d ago

2026-04-25

PUBLISHED

53d ago

2026-04-25

RELEVANCE

8/ 10

AUTHOR

gvij