A4B hits 24 tok/s for Gemma 4 26B on RTX 4060
A new inference strategy for Mixture-of-Experts models like Gemma 4 26B enables high-performance local deployment on consumer GPUs with limited VRAM. By offloading inactive experts to system RAM and keeping attention layers on the GPU, the A4B project achieves 24 tok/s on an RTX 4060 by leveraging MoE sparsity.
This technique turns the MoE architecture's massive total weight size into a deployment advantage for local users.
- –Exploit MoE sparsity to treat system RAM as a dynamic swap for inactive experts, significantly outperforming traditional CPU-only inference.
- –Maintains high throughput (24 tok/s) on entry-level mobile GPUs, making 20B+ parameter models viable for everyday use.
- –Highlights a shift in local LLM optimization where memory bandwidth between RAM and GPU becomes the new bottleneck, potentially favoring MoE over dense models for home servers.
DISCOVERED
45d ago
2026-04-15
PUBLISHED
45d ago
2026-04-14
RELEVANCE
AUTHOR
Initial_Mousse_8713