OPEN_SOURCE ↗
REDDIT · REDDIT// 6d agoBENCHMARK RESULT
Gemma-4-E2B-it NVFP4 nabs Spark wins
A community NVFP4 quantization of Google’s Gemma-4-E2B-it hits about 90 tok/s on a single DGX Spark and lands nine first-place finishes in Spark Arena-style benchmark runs. It looks especially strong at deeper context and higher concurrency, where its tiny KV footprint helps it stay competitive.
// ANALYSIS
Impressive serving efficiency, but this is a throughput story more than a quality story: the model is winning on hardware fit, quantization, and context management as much as raw model capability.
- –Single-user decode is already strong at ~89 tok/s, while prompt processing peaks at 8,765 tok/s at 4K depth
- –The PLE architecture and 512-token sliding window on most layers keep KV cache growth in check, which is why it scales better under concurrency
- –The only model ahead on single-user token generation is Qwen3.5-0.8B BF16, but that comparison is against a much smaller model and the advantage fades under load
- –This is a useful signal for local inference builders: NVFP4 plus vLLM can make a 2B-class multimodal model feel much bigger than it is
- –It is still a benchmark result, not a quality verdict; leaderboard rank here mostly reflects serving efficiency and workload shape
// TAGS
gemma-4-e2b-it-nvfp4llmmultimodalbenchmarkinferencegpuopen-weights
DISCOVERED
6d ago
2026-04-06
PUBLISHED
6d ago
2026-04-06
RELEVANCE
9/ 10
AUTHOR
CoconutMario