OPEN_SOURCE ↗
REDDIT · REDDIT// 8d agoBENCHMARK RESULT
Gemma 4 MoE hits 120 TPS on dual 3090s
A Reddit benchmark claims Gemma 4’s MoE variant reaches roughly 120 tokens per second on dual RTX 3090s. That fits Google’s positioning: the 26B MoE only activates 3.8B parameters per token and is built for fast local inference.
// ANALYSIS
This is exactly why MoE matters: it can feel like a frontier-sized model without paying the full dense-model latency tax. Still, treat this as a tuned local benchmark, not a universal speed guarantee.
- –Google says the 26B MoE activates only 3.8B parameters during inference, so high TPS on strong consumer GPUs is plausible
- –Dual 3090s is serious workstation hardware, not a typical local setup, so the result should not be generalized to average rigs
- –Real throughput will swing with quantization, context length, backend, batching, and sampling settings
- –For agentic workflows, shaving latency this far can materially improve the feel of iterative tool use and code generation
- –The benchmark strengthens Gemma 4’s case as a local-first open model, especially for users who can afford the VRAM
// TAGS
gemma-4llmbenchmarkinferencegpuopen-weights
DISCOVERED
8d ago
2026-04-04
PUBLISHED
8d ago
2026-04-04
RELEVANCE
9/ 10
AUTHOR
AaZzEL