OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoTUTORIAL
Gemma 4 MoE hits 95 tok/s on RTX 3090
Local LLM users are reporting speeds of up to 95 tok/s with Google's Gemma-4 26B-A4B model on single RTX 3090 setups, though 160k context windows are pushing the 24GB VRAM ceiling to its breaking point.
// ANALYSIS
The Gemma-4 26B-A4B MoE architecture proves that high-parameter sparse models are the new gold standard for consumer-grade inference, provided memory management is perfectly tuned.
- –Sparse Activation Efficiency: Achieving nearly 100 tok/s on a 26B-class model highlights the breakthrough performance of the 4B active parameter design on 30-series hardware.
- –The Context Memory Wall: Attempting 160k context on a single 3090 is "living on the edge," necessitating 4-bit KV cache quantization and specialized Hadamard transforms to maintain stability.
- –Tooling Specialization: The reliance on ik_llama.cpp-specific flags like `-khad` and `-muge` (Unified Gating) indicates that generic inference engines are no longer sufficient for state-of-the-art MoE models.
- –Stability over Speed: A growing segment of the community is willing to trade 50% of their throughput for "rock-solid" long-context reliability, signaling a shift from benchmark chasing to workflow integration.
// TAGS
gemma-4llmmoeinferencegpuself-hostedik-llama-cpp
DISCOVERED
3h ago
2026-04-26
PUBLISHED
6h ago
2026-04-26
RELEVANCE
8/ 10
AUTHOR
Deadhookersandblow