OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT
Gemma 4 26B-A4B Hits 25.9 t/s
A Reddit user reports 25.9 tokens/s running Gemma 4 26B-A4B in GGUF quantization on an AMD 7840HS mini PC with a Radeon RX 9060 XT 16GB eGPU. They say it is usable for OpenCode-backed codebase questions, making this a strong real-world local-inference datapoint.
// ANALYSIS
This is a useful signal that the local LLM floor keeps rising: a MoE model that sounds huge on paper is now practical enough for coding workflows on consumer hardware. The interesting part is less the raw t/s number than that it crosses the “actually useful” threshold.
- –Gemma 4 26B-A4B has only 3.8B active parameters at runtime, so its speed profile is much closer to a smaller model than its total parameter count suggests.
- –The official model card positions Gemma 4 26B-A4B for GPUs and workstations, and this setup fits that story: 16GB VRAM plus a tight quantization is enough to make it work.
- –The model file alone is about 13.4GB for `UD-IQ4_NL`, so KV cache and runtime buffers leave very little headroom on a 16GB card; that explains why `-b` and `-ub` become the first hard limit.
- –`--fit`, `--fit-ctx`, and `--fit-target` are the right knobs for memory planning here; if load stability matters more than throughput, shrinking batch sizes or cache precision is the more likely win than piling on more flags.
- –For AI devs, the important takeaway is practical: an eGPU-backed mini PC can now be a credible local code assistant box, not just a toy demo rig.
// TAGS
gemma-4-26b-a4bllminferencegpuai-codingbenchmark
DISCOVERED
3h ago
2026-05-01
PUBLISHED
5h ago
2026-05-01
RELEVANCE
8/ 10
AUTHOR
CrowKing63