BACK_TO_FEEDAICRIER_2
Gemma 4 26B-A4B hits 196 tok/s on 5090
OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoTUTORIAL

Gemma 4 26B-A4B hits 196 tok/s on 5090

This guide walks through running Gemma 4 26B-A4B on a single RTX 5090 with vLLM on RunPod Serverless. The working setup uses AWQ 4-bit weights, FP8 KV cache, and tool-calling flags to reach about 196 tok/s decode with 96k context.

// ANALYSIS

This is a practical deployment report, not just a benchmark flex: the main takeaway is that on consumer Blackwell, ecosystem support matters more than theoretical peak formats. NVFP4 looks promising, but stable vLLM still blocks the Gemma 4 MoE path, so AWQ is the only usable option today.

  • Stable vLLM plus AWQ/Marlin is the real production path right now; the native FP4 route is still gated by an unmerged MoE weight-mapping fix.
  • The performance is strong for a single-GPU private endpoint: ~196 tok/s decode with warm TTFT in the 1-3s range is good enough for coding-agent workloads.
  • The post is valuable because it captures the non-obvious breakpoints: CUDA 12.9 driver filtering, the Gemma 4 tool parser, and the exact chat template all matter.
  • FP8 KV cache is the right Blackwell-specific win here, since it stretches usable context without turning the deployment into a science project.
  • For anyone trying to self-host recent MoE models, this reads like a reproducible template for the current state of the stack, not just one-off numbers.
// TAGS
gemma-4-26b-a4bvllmrunpodllminferencegpucloudself-hosted

DISCOVERED

4h ago

2026-04-19

PUBLISHED

6h ago

2026-04-19

RELEVANCE

8/ 10

AUTHOR

sudo_ls_ads