OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoMODEL RELEASE
NVIDIA Gemma 4 NVFP4 lands
NVIDIA released an NVFP4-quantized Gemma 4 26B A4B checkpoint on Hugging Face, aimed at Blackwell-class inference with vLLM. The model keeps benchmark quality close to full precision while shrinking the footprint to a size that community testers say fits on a 5090 with room for long context.
// ANALYSIS
This is less about a flashy new model than about making a strong open-weight model materially easier to run locally. The real signal is that NVIDIA is pushing a deployment-ready quantized path, not just bragging about raw scores.
- –The benchmark deltas are tiny: GPQA, MMLU Pro, LiveCodeBench, and IFEval all stay near full precision, while AIME even ticks up slightly.
- –At 18.8GB, the checkpoint is small enough to be practical on high-end consumer GPUs, and the Reddit report of roughly 50K context on a 5090 suggests it is actually usable, not just theoretically supported.
- –The model card says vLLM support is available, but also notes current MoE limitations like TP=1 only, so this is still an infrastructure story as much as a model story.
- –For developers, the value is deployment economics: lower memory pressure, faster iteration, and a cleaner path to running a capable multimodal model on local or edge NVIDIA hardware.
- –The release strengthens Gemma 4’s position as a serious open-model family, but the differentiator here is NVIDIA’s quantization and runtime packaging around it, not a new architecture.
// TAGS
gemma-4-26b-a4b-nvfp4llmbenchmarkinferencegpureasoningmultimodal
DISCOVERED
3h ago
2026-05-01
PUBLISHED
3h ago
2026-05-01
RELEVANCE
9/ 10
AUTHOR
reto-wyss