BACK_TO_FEEDAICRIER_2
Modular MAX lands Gemma 4, beats vLLM
OPEN_SOURCE ↗
REDDIT · REDDIT// 9d agoINFRASTRUCTURE

Modular MAX lands Gemma 4, beats vLLM

Modular says it had Gemma 4 running on its MAX inference stack on launch day across NVIDIA B200 and AMD MI355X, using the same serving layer for both vendors. On B200, it reports 15% higher output throughput than vLLM, while Gemma 4 itself brings 256K context, native multimodality, and open Apache 2.0 weights.

// ANALYSIS

The interesting part here is less the model release than the infrastructure story: Modular is positioning MAX as the portable serving layer for heterogeneous datacenter fleets, not a one-off benchmark harness.

  • Day-zero support for both Blackwell and AMD hardware is the real differentiator for teams that do not want separate stacks per vendor
  • The 15% vLLM win is credible marketing only if the methodology is clear; decode mix, batching, quantization, and context length can move throughput materially
  • Gemma 4’s 256K context and multimodal inputs raise serving complexity, so a unified inference stack matters more than raw model compatibility
  • Apache 2.0 licensing makes Gemma 4 easier to adopt in private and commercial deployments, which helps infrastructure vendors like Modular sell the portability story
  • This reads as a platform proof point for MAX: open models, OpenAI-compatible serving, and GPU-agnostic deployment in one stack
// TAGS
maxgemma-4inferencegpumultimodalbenchmarkopen-source

DISCOVERED

9d ago

2026-04-02

PUBLISHED

9d ago

2026-04-02

RELEVANCE

9/ 10

AUTHOR

carolinedfrasca