BACK_TO_FEEDAICRIER_2
Gemma 4 26B-A4B outruns 31B on M1 Max
OPEN_SOURCE ↗
REDDIT · REDDIT// 8d agoBENCHMARK RESULT

Gemma 4 26B-A4B outruns 31B on M1 Max

A Reddit user reports that Gemma 4 26B-A4B, quantized to Q5_K_S and run in LM Studio on an Apple M1 Max with 32GB unified memory, reaches about 50 tokens per second at a 65,536-token context while using roughly 22GB of memory. In the same setup, Gemma 4 31B Q4_K_S only manages around 10 to 11 tokens per second, making the MoE 26B-A4B variant look much better suited for fast local inference on Apple Silicon.

// ANALYSIS

Hot take: this is the kind of result that makes the MoE design matter in practice, not just on paper.

  • The reported gap is large enough to make the 26B-A4B the more compelling local model for M1 Max-class machines.
  • The 26B-A4B’s lower active-parameter load is consistent with the speedup; the 31B dense model is simply doing more work per token.
  • At 65K context, 22GB of memory usage is practical on a 32GB system, which is the real constraint for many local users.
  • This is a single-user benchmark, so treat it as anecdotal, but it lines up with Gemma 4’s intended positioning for efficient workstation inference.
// TAGS
gemmagemma-4local-llmapple-siliconm1-maxmoequantizationlm-studiobenchmark

DISCOVERED

8d ago

2026-04-04

PUBLISHED

8d ago

2026-04-04

RELEVANCE

8/ 10

AUTHOR

Beamsters