BACK_TO_FEEDAICRIER_2
Gemma-4-26B-A4B-it-NVFP4 runs on vLLM via community patch
OPEN_SOURCE ↗
REDDIT · REDDIT// 6d agoOPENSOURCE RELEASE

Gemma-4-26B-A4B-it-NVFP4 runs on vLLM via community patch

Developers have successfully deployed the Gemma-4-26B-A4B-it-NVFP4 model on vLLM by applying a custom Python patch that resolves weight scale mapping issues for Mixture-of-Experts (MoE) layers. The implementation leverages the Marlin backend for optimized performance on NVIDIA Blackwell (SM 12.1) hardware, specifically utilizing NVFP4 quantization and FP8 KV cache for high-efficiency local serving. This community-driven fix provides a critical bridge for running Google's latest open-weights model on vLLM ahead of official architectural support in the main repository.

// ANALYSIS

This successful run demonstrates the speed at which the open-source community adapts new architectures to existing inference frameworks.

  • The patch specifically fixes expert_params_mapping errors where dot-separated scale keys failed to map correctly to fused MoE parameters in the vLLM executor.
  • Mandatory use of the Marlin backend highlights the shift toward specialized kernels for Blackwell's SM 12.1 architecture in 4-bit (NVFP4) deployments.
  • The combination of NVFP4 and FP8 KV cache represents a significant leap in memory efficiency, enabling 26B-class models to run with higher parameter density on local hardware.
// TAGS
gemma-4-26b-a4b-it-nvfp4gemma-4vllmnvfp4blackwellmarlinmoequantizationopen-source

DISCOVERED

6d ago

2026-04-06

PUBLISHED

6d ago

2026-04-06

RELEVANCE

8/ 10

AUTHOR

NovelAdorable7033