REDDIT · REDDIT// 3h agoMODEL RELEASE

NVIDIA Gemma 4 NVFP4 lands

NVIDIA released an NVFP4-quantized Gemma 4 26B A4B checkpoint on Hugging Face, aimed at Blackwell-class inference with vLLM. The model keeps benchmark quality close to full precision while shrinking the footprint to a size that community testers say fits on a 5090 with room for long context.

// ANALYSIS

This is less about a flashy new model than about making a strong open-weight model materially easier to run locally. The real signal is that NVIDIA is pushing a deployment-ready quantized path, not just bragging about raw scores.

–The benchmark deltas are tiny: GPQA, MMLU Pro, LiveCodeBench, and IFEval all stay near full precision, while AIME even ticks up slightly.
–At 18.8GB, the checkpoint is small enough to be practical on high-end consumer GPUs, and the Reddit report of roughly 50K context on a 5090 suggests it is actually usable, not just theoretically supported.
–The model card says vLLM support is available, but also notes current MoE limitations like TP=1 only, so this is still an infrastructure story as much as a model story.
–For developers, the value is deployment economics: lower memory pressure, faster iteration, and a cleaner path to running a capable multimodal model on local or edge NVIDIA hardware.
–The release strengthens Gemma 4’s position as a serious open-model family, but the differentiator here is NVIDIA’s quantization and runtime packaging around it, not a new architecture.

// TAGS

gemma-4-26b-a4b-nvfp4llmbenchmarkinferencegpureasoningmultimodal

DISCOVERED

3h ago

2026-05-01

PUBLISHED

3h ago

2026-05-01

RELEVANCE

9/ 10

AUTHOR

reto-wyss