BACK_TO_FEEDAICRIER_2
Gemma-4-26B-A4B hits 15 tps via Unsloth
OPEN_SOURCE ↗
REDDIT · REDDIT// 9d agoMODEL RELEASE

Gemma-4-26B-A4B hits 15 tps via Unsloth

Google's Gemma-4-26B-A4B MoE model achieves impressive local inference speeds on mid-range consumer hardware using Unsloth's MXFP4 quantization. The optimization allows the 26B parameter model to run at 12-15 tokens per second on an AMD RX 6600.

// ANALYSIS

The arrival of MXFP4 quantization for the Gemma 4 family marks a significant leap in making high-quality Mixture-of-Experts (MoE) models accessible on budget hardware.

  • MoE architecture with 4B active parameters delivers 26B-level reasoning with the compute footprint of a much smaller model
  • MXFP4 (Microscaling Floating Point) maintains higher precision than traditional 4-bit integer quantization, preserving model intelligence
  • Achieving 15 tps on a sub-$200 GPU like the RX 6600 proves that sophisticated AI is no longer gated by high-end VRAM
  • Support for 256K context windows via Unsloth optimizations enables large-scale local RAG and document analysis
  • First-class Vulkan support in LM Studio further expands the hardware compatibility beyond just NVIDIA users
// TAGS
gemma-4-26b-a4bunslothllmopen-weightsinferencegpumxfp4

DISCOVERED

9d ago

2026-04-03

PUBLISHED

9d ago

2026-04-02

RELEVANCE

9/ 10

AUTHOR

mr_happy_nice