OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoINFRASTRUCTURE
Gemma 4 31B-it trips vLLM bug
A Reddit user reports Gemma 4 31B-it producing repetitive `thinking nvarchar(max)` output when served through vLLM with `--reasoning-parser gemma4` and `enable_thinking: false`. The behavior looks like a parser/template interaction bug in vLLM’s structured-output path, not a model-quality issue.
// ANALYSIS
The interesting part here is that the model is probably fine; the serving stack is what’s misbehaving. vLLM’s Gemma 4 reasoning parser appears to be gating structured decoding in a way that can leave the request in a broken reasoning state.
- –The reported symptom matches a known class of vLLM parser bugs: reasoning/parsing state not cleanly transitioning when thinking is disabled
- –If structured output or tool use is involved, this can silently degrade correctness instead of failing loudly
- –The issue is operationally relevant for anyone self-hosting Gemma 4 behind OpenAI-compatible APIs
- –The repeated `nvarchar(max)` text strongly suggests template contamination or token leakage from the serving layer, not a normal model completion
- –For now, the safest workaround is likely to remove the reasoning parser or adjust the chat template/thinking settings until the vLLM side is fixed
// TAGS
llminferencereasoningstructured-outputself-hostedvllmgemma-4
DISCOVERED
4h ago
2026-05-06
PUBLISHED
7h ago
2026-05-06
RELEVANCE
8/ 10
AUTHOR
Fr4y3R