BACK_TO_FEEDAICRIER_2
Gemma-4-E2B-it NVFP4 nabs Spark wins
OPEN_SOURCE ↗
REDDIT · REDDIT// 6d agoBENCHMARK RESULT

Gemma-4-E2B-it NVFP4 nabs Spark wins

A community NVFP4 quantization of Google’s Gemma-4-E2B-it hits about 90 tok/s on a single DGX Spark and lands nine first-place finishes in Spark Arena-style benchmark runs. It looks especially strong at deeper context and higher concurrency, where its tiny KV footprint helps it stay competitive.

// ANALYSIS

Impressive serving efficiency, but this is a throughput story more than a quality story: the model is winning on hardware fit, quantization, and context management as much as raw model capability.

  • Single-user decode is already strong at ~89 tok/s, while prompt processing peaks at 8,765 tok/s at 4K depth
  • The PLE architecture and 512-token sliding window on most layers keep KV cache growth in check, which is why it scales better under concurrency
  • The only model ahead on single-user token generation is Qwen3.5-0.8B BF16, but that comparison is against a much smaller model and the advantage fades under load
  • This is a useful signal for local inference builders: NVFP4 plus vLLM can make a 2B-class multimodal model feel much bigger than it is
  • It is still a benchmark result, not a quality verdict; leaderboard rank here mostly reflects serving efficiency and workload shape
// TAGS
gemma-4-e2b-it-nvfp4llmmultimodalbenchmarkinferencegpuopen-weights

DISCOVERED

6d ago

2026-04-06

PUBLISHED

6d ago

2026-04-06

RELEVANCE

9/ 10

AUTHOR

CoconutMario