Gemma-4-26B-A4B-it-NVFP4 runs on vLLM via community patch
Developers have successfully deployed the Gemma-4-26B-A4B-it-NVFP4 model on vLLM by applying a custom Python patch that resolves weight scale mapping issues for Mixture-of-Experts (MoE) layers. The implementation leverages the Marlin backend for optimized performance on NVIDIA Blackwell (SM 12.1) hardware, specifically utilizing NVFP4 quantization and FP8 KV cache for high-efficiency local serving. This community-driven fix provides a critical bridge for running Google's latest open-weights model on vLLM ahead of official architectural support in the main repository.
This successful run demonstrates the speed at which the open-source community adapts new architectures to existing inference frameworks.
- –The patch specifically fixes expert_params_mapping errors where dot-separated scale keys failed to map correctly to fused MoE parameters in the vLLM executor.
- –Mandatory use of the Marlin backend highlights the shift toward specialized kernels for Blackwell's SM 12.1 architecture in 4-bit (NVFP4) deployments.
- –The combination of NVFP4 and FP8 KV cache represents a significant leap in memory efficiency, enabling 26B-class models to run with higher parameter density on local hardware.
DISCOVERED
6d ago
2026-04-06
PUBLISHED
6d ago
2026-04-06
RELEVANCE
AUTHOR
NovelAdorable7033