OPEN_SOURCE ↗
REDDIT · REDDIT// 8d agoBENCHMARK RESULT
Gemma 4 31B tops GPQA Diamond
Google’s Gemma 4 31B dense model is drawing attention for a community benchmark claim of 85.7% on GPQA Diamond, nearly matching Qwen3.5 27B while using fewer output tokens. Google’s launch also positions it as a single-H100, 256K-context, multimodal open model family.
// ANALYSIS
The interesting part here is not just the score, but the implied efficiency curve: if the benchmark holds up, Gemma 4 is squeezing near-frontier reasoning into a much more deployable footprint.
- –Google’s official launch says the 31B dense model fits on a single 80GB H100, which makes this feel less like lab bragging and more like something teams can actually run.
- –The Reddit post’s token-efficiency claim is the real differentiator: similar benchmark performance with fewer output tokens suggests lower inference cost per useful answer.
- –Gemma 4’s 256K context, multimodal input, and native function-calling make it more than a chat model; it’s clearly aimed at agentic workflows and local developer tooling.
- –The caution flag is provenance: this specific Qwen comparison is a community benchmark claim, not an official Google benchmark, so it should be treated as promising but not definitive.
- –Still, Apache 2.0 plus open weights means adoption friction is low, which is exactly what the open-model ecosystem needs right now.
// TAGS
gemma-4llmreasoningmultimodalopen-weightsbenchmarkgpu
DISCOVERED
8d ago
2026-04-03
PUBLISHED
9d ago
2026-04-03
RELEVANCE
10/ 10
AUTHOR
Pascal22_