BACK_TO_FEEDAICRIER_2
Gemma 4 MoE hits 95 tok/s on RTX 3090
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoTUTORIAL

Gemma 4 MoE hits 95 tok/s on RTX 3090

Local LLM users are reporting speeds of up to 95 tok/s with Google's Gemma-4 26B-A4B model on single RTX 3090 setups, though 160k context windows are pushing the 24GB VRAM ceiling to its breaking point.

// ANALYSIS

The Gemma-4 26B-A4B MoE architecture proves that high-parameter sparse models are the new gold standard for consumer-grade inference, provided memory management is perfectly tuned.

  • Sparse Activation Efficiency: Achieving nearly 100 tok/s on a 26B-class model highlights the breakthrough performance of the 4B active parameter design on 30-series hardware.
  • The Context Memory Wall: Attempting 160k context on a single 3090 is "living on the edge," necessitating 4-bit KV cache quantization and specialized Hadamard transforms to maintain stability.
  • Tooling Specialization: The reliance on ik_llama.cpp-specific flags like `-khad` and `-muge` (Unified Gating) indicates that generic inference engines are no longer sufficient for state-of-the-art MoE models.
  • Stability over Speed: A growing segment of the community is willing to trade 50% of their throughput for "rock-solid" long-context reliability, signaling a shift from benchmark chasing to workflow integration.
// TAGS
gemma-4llmmoeinferencegpuself-hostedik-llama-cpp

DISCOVERED

3h ago

2026-04-26

PUBLISHED

6h ago

2026-04-26

RELEVANCE

8/ 10

AUTHOR

Deadhookersandblow