Gemma 4 26B-A4B hits 196 tok/s on 5090

// 90d agoTUTORIAL

Gemma 4 26B-A4B hits 196 tok/s on 5090

This guide walks through running Gemma 4 26B-A4B on a single RTX 5090 with vLLM on RunPod Serverless. The working setup uses AWQ 4-bit weights, FP8 KV cache, and tool-calling flags to reach about 196 tok/s decode with 96k context.

// ANALYSIS

This is a practical deployment report, not just a benchmark flex: the main takeaway is that on consumer Blackwell, ecosystem support matters more than theoretical peak formats. NVFP4 looks promising, but stable vLLM still blocks the Gemma 4 MoE path, so AWQ is the only usable option today.

–Stable vLLM plus AWQ/Marlin is the real production path right now; the native FP4 route is still gated by an unmerged MoE weight-mapping fix.
–The performance is strong for a single-GPU private endpoint: ~196 tok/s decode with warm TTFT in the 1-3s range is good enough for coding-agent workloads.
–The post is valuable because it captures the non-obvious breakpoints: CUDA 12.9 driver filtering, the Gemma 4 tool parser, and the exact chat template all matter.
–FP8 KV cache is the right Blackwell-specific win here, since it stretches usable context without turning the deployment into a science project.
–For anyone trying to self-host recent MoE models, this reads like a reproducible template for the current state of the stack, not just one-off numbers.

// TAGS

gemma-4-26b-a4bvllmrunpodllminferencegpucloudself-hosted

DISCOVERED

90d ago

2026-04-19

PUBLISHED

90d ago

2026-04-19

RELEVANCE

8/ 10

AUTHOR

sudo_ls_ads

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL46m ago

Kimi K3 launch strengthens open-source case

The release of Moonshot AI's Kimi K3, an open-weights model with 2.8 trillion parameters, a 1-million-token context window, and native visual processing, has sparked discussion about the viability of proprietary frontier LLM training. As open-weights models achieve performance parity with proprietary systems on key coding and agentic benchmarks, developers and investors are increasingly questioning the massive capital requirements of closed-source frontier projects in favor of more cost-effective open alternatives.

MODEL1h ago

Moonshot AI launches Kimi K3

Moonshot AI has launched Kimi K3, a natively multimodal 2.8-trillion-parameter model with a 1-million-token context window. Built on a novel attention architecture, the model is optimized for long-horizon coding and multi-step reasoning tasks.

MODEL3h ago

NVIDIA launches Ardy real-time motion model

NVIDIA's Spatial Intelligence Lab has developed Ardy, an autoregressive diffusion model for real-time, interactive 3D human motion generation. The model supports online text prompting and flexible kinematic constraints at inference time without requiring retraining, making it suitable for animation, gaming, and robotics.