OPEN_SOURCE ↗
REDDIT · REDDIT// 8d agoTUTORIAL
Gemma 4 MoE lands single-box NVFP4 serving
This is a hands-on deployment write-up for getting Google’s Gemma 4 26B-A4B MoE model running efficiently on a single NVIDIA DGX Spark. The core contribution is a custom NVFP4 quantization flow that unfuses Gemma 4’s MoE experts before quantizing them, plus a small vLLM patch to load the resulting checkpoint correctly. The result is a model that reportedly fits in 16.5GB and serves at roughly 45-60 tok/s with 256K context support, using vLLM with the right MoE backend and chat endpoint setup.
// ANALYSIS
Strong niche tutorial with real operational value for people trying to run large MoE models locally.
- –The useful part is not just the benchmark claim, but the exact failure modes: skipped expert quantization, incorrect MoE scale-key mapping, and the need for `--moe-backend marlin`.
- –This reads like an enabling post for a very specific hardware/software stack, so it is most relevant to local-LLM practitioners rather than a broad audience.
- –The write-up also clarifies an easy-to-miss serving gotcha: use chat completions, not raw completions, or you can end up debugging repetition artifacts that are really prompt/endpoint misuse.
- –The Product Hunt surface area seems to be Google Gemma 4 generally, but this post is specifically about a community checkpoint and serving patch around that model family.
// TAGS
gemma 4moenvfp4vllmdgx sparkquantizationlocal llmnvidiablackwellhugging face
DISCOVERED
8d ago
2026-04-03
PUBLISHED
8d ago
2026-04-03
RELEVANCE
8/ 10
AUTHOR
CoconutMario