YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Gemma 4 MoE hits 95 tok/s on RTX 3090

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Gemma 4 MoE hits 95 tok/s on RTX 3090
OPEN LINK ↗
// 45d agoTUTORIAL

Gemma 4 MoE hits 95 tok/s on RTX 3090

Local LLM users are reporting speeds of up to 95 tok/s with Google's Gemma-4 26B-A4B model on single RTX 3090 setups, though 160k context windows are pushing the 24GB VRAM ceiling to its breaking point.

// ANALYSIS

The Gemma-4 26B-A4B MoE architecture proves that high-parameter sparse models are the new gold standard for consumer-grade inference, provided memory management is perfectly tuned.

  • Sparse Activation Efficiency: Achieving nearly 100 tok/s on a 26B-class model highlights the breakthrough performance of the 4B active parameter design on 30-series hardware.
  • The Context Memory Wall: Attempting 160k context on a single 3090 is "living on the edge," necessitating 4-bit KV cache quantization and specialized Hadamard transforms to maintain stability.
  • Tooling Specialization: The reliance on ik_llama.cpp-specific flags like `-khad` and `-muge` (Unified Gating) indicates that generic inference engines are no longer sufficient for state-of-the-art MoE models.
  • Stability over Speed: A growing segment of the community is willing to trade 50% of their throughput for "rock-solid" long-context reliability, signaling a shift from benchmark chasing to workflow integration.
// TAGS
gemma-4llmmoeinferencegpuself-hostedik-llama-cpp

DISCOVERED

45d ago

2026-04-26

PUBLISHED

45d ago

2026-04-26

RELEVANCE

8/ 10

AUTHOR

Deadhookersandblow