OPEN_SOURCE ↗
REDDIT · REDDIT// 1d agoINFRASTRUCTURE
Gemma 4 26B hangs vLLM startup
Developers are reporting engine-core failures and startup hangs when serving Google's Gemma 4 26B-A4B checkpoint in vLLM, especially across multi-node Spark setups. The model is a 26B MoE with only 3.8B active parameters, so this reads more like a serving-stack compatibility issue than a raw capacity limit.
// ANALYSIS
This looks like the usual gap between a polished model launch and production-ready distributed serving. Gemma 4 is efficient on paper, but the rough edges are in the orchestration layer, not the checkpoint size.
- –The reported `RayTaskError(ValueError)` points at Ray-backed startup and worker coordination, not generation quality
- –vLLM's Gemma 4 docs support the model, but the deployment path still has sharp edges around memory profiling and multimodal initialization
- –The docs explicitly recommend trimming multimodal work with `--limit-mm-per-prompt image=0 audio=0` for text-only workloads, which suggests startup memory accounting is still expensive
- –Quantization may reduce pressure, but it will not fix a broken distributed boot path if the failure is in profiling or engine selection
- –For adopters, this is a reminder that "supported" and "boringly deployable" are not the same thing yet
// TAGS
gemma-4vllmrayllminferencegpuopen-source
DISCOVERED
1d ago
2026-04-10
PUBLISHED
2d ago
2026-04-10
RELEVANCE
8/ 10
AUTHOR
No_Brilliant_7649