OPEN_SOURCE ↗
REDDIT · REDDIT// 8d agoBENCHMARK RESULT
Gemma 4 26B-A4B outruns 31B on M1 Max
A Reddit user reports that Gemma 4 26B-A4B, quantized to Q5_K_S and run in LM Studio on an Apple M1 Max with 32GB unified memory, reaches about 50 tokens per second at a 65,536-token context while using roughly 22GB of memory. In the same setup, Gemma 4 31B Q4_K_S only manages around 10 to 11 tokens per second, making the MoE 26B-A4B variant look much better suited for fast local inference on Apple Silicon.
// ANALYSIS
Hot take: this is the kind of result that makes the MoE design matter in practice, not just on paper.
- –The reported gap is large enough to make the 26B-A4B the more compelling local model for M1 Max-class machines.
- –The 26B-A4B’s lower active-parameter load is consistent with the speedup; the 31B dense model is simply doing more work per token.
- –At 65K context, 22GB of memory usage is practical on a 32GB system, which is the real constraint for many local users.
- –This is a single-user benchmark, so treat it as anecdotal, but it lines up with Gemma 4’s intended positioning for efficient workstation inference.
// TAGS
gemmagemma-4local-llmapple-siliconm1-maxmoequantizationlm-studiobenchmark
DISCOVERED
8d ago
2026-04-04
PUBLISHED
8d ago
2026-04-04
RELEVANCE
8/ 10
AUTHOR
Beamsters