BACK_TO_FEEDAICRIER_2
Gemma 4 31B-it trips vLLM bug
OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoINFRASTRUCTURE

Gemma 4 31B-it trips vLLM bug

A Reddit user reports Gemma 4 31B-it producing repetitive `thinking nvarchar(max)` output when served through vLLM with `--reasoning-parser gemma4` and `enable_thinking: false`. The behavior looks like a parser/template interaction bug in vLLM’s structured-output path, not a model-quality issue.

// ANALYSIS

The interesting part here is that the model is probably fine; the serving stack is what’s misbehaving. vLLM’s Gemma 4 reasoning parser appears to be gating structured decoding in a way that can leave the request in a broken reasoning state.

  • The reported symptom matches a known class of vLLM parser bugs: reasoning/parsing state not cleanly transitioning when thinking is disabled
  • If structured output or tool use is involved, this can silently degrade correctness instead of failing loudly
  • The issue is operationally relevant for anyone self-hosting Gemma 4 behind OpenAI-compatible APIs
  • The repeated `nvarchar(max)` text strongly suggests template contamination or token leakage from the serving layer, not a normal model completion
  • For now, the safest workaround is likely to remove the reasoning parser or adjust the chat template/thinking settings until the vLLM side is fixed
// TAGS
llminferencereasoningstructured-outputself-hostedvllmgemma-4

DISCOVERED

4h ago

2026-05-06

PUBLISHED

7h ago

2026-05-06

RELEVANCE

8/ 10

AUTHOR

Fr4y3R