BACK_TO_FEEDAICRIER_2
Gemma 4 MoE hits 120 TPS on dual 3090s
OPEN_SOURCE ↗
REDDIT · REDDIT// 8d agoBENCHMARK RESULT

Gemma 4 MoE hits 120 TPS on dual 3090s

A Reddit benchmark claims Gemma 4’s MoE variant reaches roughly 120 tokens per second on dual RTX 3090s. That fits Google’s positioning: the 26B MoE only activates 3.8B parameters per token and is built for fast local inference.

// ANALYSIS

This is exactly why MoE matters: it can feel like a frontier-sized model without paying the full dense-model latency tax. Still, treat this as a tuned local benchmark, not a universal speed guarantee.

  • Google says the 26B MoE activates only 3.8B parameters during inference, so high TPS on strong consumer GPUs is plausible
  • Dual 3090s is serious workstation hardware, not a typical local setup, so the result should not be generalized to average rigs
  • Real throughput will swing with quantization, context length, backend, batching, and sampling settings
  • For agentic workflows, shaving latency this far can materially improve the feel of iterative tool use and code generation
  • The benchmark strengthens Gemma 4’s case as a local-first open model, especially for users who can afford the VRAM
// TAGS
gemma-4llmbenchmarkinferencegpuopen-weights

DISCOVERED

8d ago

2026-04-04

PUBLISHED

8d ago

2026-04-04

RELEVANCE

9/ 10

AUTHOR

AaZzEL