OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoTUTORIAL
Gemma 4 26B-A4B hits 196 tok/s on 5090
This guide walks through running Gemma 4 26B-A4B on a single RTX 5090 with vLLM on RunPod Serverless. The working setup uses AWQ 4-bit weights, FP8 KV cache, and tool-calling flags to reach about 196 tok/s decode with 96k context.
// ANALYSIS
This is a practical deployment report, not just a benchmark flex: the main takeaway is that on consumer Blackwell, ecosystem support matters more than theoretical peak formats. NVFP4 looks promising, but stable vLLM still blocks the Gemma 4 MoE path, so AWQ is the only usable option today.
- –Stable vLLM plus AWQ/Marlin is the real production path right now; the native FP4 route is still gated by an unmerged MoE weight-mapping fix.
- –The performance is strong for a single-GPU private endpoint: ~196 tok/s decode with warm TTFT in the 1-3s range is good enough for coding-agent workloads.
- –The post is valuable because it captures the non-obvious breakpoints: CUDA 12.9 driver filtering, the Gemma 4 tool parser, and the exact chat template all matter.
- –FP8 KV cache is the right Blackwell-specific win here, since it stretches usable context without turning the deployment into a science project.
- –For anyone trying to self-host recent MoE models, this reads like a reproducible template for the current state of the stack, not just one-off numbers.
// TAGS
gemma-4-26b-a4bvllmrunpodllminferencegpucloudself-hosted
DISCOVERED
4h ago
2026-04-19
PUBLISHED
6h ago
2026-04-19
RELEVANCE
8/ 10
AUTHOR
sudo_ls_ads