Gemma 4 MoE lands single-box NVFP4 serving
This is a hands-on deployment write-up for getting Google’s Gemma 4 26B-A4B MoE model running efficiently on a single NVIDIA DGX Spark. The core contribution is a custom NVFP4 quantization flow that unfuses Gemma 4’s MoE experts before quantizing them, plus a small vLLM patch to load the resulting checkpoint correctly. The result is a model that reportedly fits in 16.5GB and serves at roughly 45-60 tok/s with 256K context support, using vLLM with the right MoE backend and chat endpoint setup.
Strong niche tutorial with real operational value for people trying to run large MoE models locally.
- –The useful part is not just the benchmark claim, but the exact failure modes: skipped expert quantization, incorrect MoE scale-key mapping, and the need for `--moe-backend marlin`.
- –This reads like an enabling post for a very specific hardware/software stack, so it is most relevant to local-LLM practitioners rather than a broad audience.
- –The write-up also clarifies an easy-to-miss serving gotcha: use chat completions, not raw completions, or you can end up debugging repetition artifacts that are really prompt/endpoint misuse.
- –The Product Hunt surface area seems to be Google Gemma 4 generally, but this post is specifically about a community checkpoint and serving patch around that model family.
DISCOVERED
54d ago
2026-04-03
PUBLISHED
54d ago
2026-04-03
RELEVANCE
AUTHOR
CoconutMario