REDDIT · REDDIT// 4h agoINFRASTRUCTURE

Gemma 4 31B-it trips vLLM bug

A Reddit user reports Gemma 4 31B-it producing repetitive `thinking nvarchar(max)` output when served through vLLM with `--reasoning-parser gemma4` and `enable_thinking: false`. The behavior looks like a parser/template interaction bug in vLLM’s structured-output path, not a model-quality issue.

// ANALYSIS

The interesting part here is that the model is probably fine; the serving stack is what’s misbehaving. vLLM’s Gemma 4 reasoning parser appears to be gating structured decoding in a way that can leave the request in a broken reasoning state.

–The reported symptom matches a known class of vLLM parser bugs: reasoning/parsing state not cleanly transitioning when thinking is disabled
–If structured output or tool use is involved, this can silently degrade correctness instead of failing loudly
–The issue is operationally relevant for anyone self-hosting Gemma 4 behind OpenAI-compatible APIs
–The repeated `nvarchar(max)` text strongly suggests template contamination or token leakage from the serving layer, not a normal model completion
–For now, the safest workaround is likely to remove the reasoning parser or adjust the chat template/thinking settings until the vLLM side is fixed

// TAGS

llminferencereasoningstructured-outputself-hostedvllmgemma-4

DISCOVERED

4h ago

2026-05-06

PUBLISHED

7h ago

2026-05-06

RELEVANCE

8/ 10

AUTHOR

Fr4y3R