BACK_TO_FEEDAICRIER_2
Gemma 4 26B-A4B Hits 25.9 t/s
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT

Gemma 4 26B-A4B Hits 25.9 t/s

A Reddit user reports 25.9 tokens/s running Gemma 4 26B-A4B in GGUF quantization on an AMD 7840HS mini PC with a Radeon RX 9060 XT 16GB eGPU. They say it is usable for OpenCode-backed codebase questions, making this a strong real-world local-inference datapoint.

// ANALYSIS

This is a useful signal that the local LLM floor keeps rising: a MoE model that sounds huge on paper is now practical enough for coding workflows on consumer hardware. The interesting part is less the raw t/s number than that it crosses the “actually useful” threshold.

  • Gemma 4 26B-A4B has only 3.8B active parameters at runtime, so its speed profile is much closer to a smaller model than its total parameter count suggests.
  • The official model card positions Gemma 4 26B-A4B for GPUs and workstations, and this setup fits that story: 16GB VRAM plus a tight quantization is enough to make it work.
  • The model file alone is about 13.4GB for `UD-IQ4_NL`, so KV cache and runtime buffers leave very little headroom on a 16GB card; that explains why `-b` and `-ub` become the first hard limit.
  • `--fit`, `--fit-ctx`, and `--fit-target` are the right knobs for memory planning here; if load stability matters more than throughput, shrinking batch sizes or cache precision is the more likely win than piling on more flags.
  • For AI devs, the important takeaway is practical: an eGPU-backed mini PC can now be a credible local code assistant box, not just a toy demo rig.
// TAGS
gemma-4-26b-a4bllminferencegpuai-codingbenchmark

DISCOVERED

3h ago

2026-05-01

PUBLISHED

5h ago

2026-05-01

RELEVANCE

8/ 10

AUTHOR

CrowKing63