OPEN_SOURCE ↗
REDDIT · REDDIT// 9d agoMODEL RELEASE
Gemma-4-26B-A4B hits 15 tps via Unsloth
Google's Gemma-4-26B-A4B MoE model achieves impressive local inference speeds on mid-range consumer hardware using Unsloth's MXFP4 quantization. The optimization allows the 26B parameter model to run at 12-15 tokens per second on an AMD RX 6600.
// ANALYSIS
The arrival of MXFP4 quantization for the Gemma 4 family marks a significant leap in making high-quality Mixture-of-Experts (MoE) models accessible on budget hardware.
- –MoE architecture with 4B active parameters delivers 26B-level reasoning with the compute footprint of a much smaller model
- –MXFP4 (Microscaling Floating Point) maintains higher precision than traditional 4-bit integer quantization, preserving model intelligence
- –Achieving 15 tps on a sub-$200 GPU like the RX 6600 proves that sophisticated AI is no longer gated by high-end VRAM
- –Support for 256K context windows via Unsloth optimizations enables large-scale local RAG and document analysis
- –First-class Vulkan support in LM Studio further expands the hardware compatibility beyond just NVIDIA users
// TAGS
gemma-4-26b-a4bunslothllmopen-weightsinferencegpumxfp4
DISCOVERED
9d ago
2026-04-03
PUBLISHED
9d ago
2026-04-02
RELEVANCE
9/ 10
AUTHOR
mr_happy_nice