BACK_TO_FEEDAICRIER_2
Gemma 4 31B loops in llama.cpp
OPEN_SOURCE ↗
REDDIT · REDDIT// 4d agoMODEL RELEASE

Gemma 4 31B loops in llama.cpp

Users are reporting infinite looping issues with Google's new Gemma 4 31B model in recent llama.cpp builds. The bug stems from a combination of incorrect newline tokenization and misclassified control tokens that disrupt the model's new reasoning and tool-calling architecture.

// ANALYSIS

Gemma 4's specialized "thinking" mode is exposing deep-seated tokenizer assumptions in local inference engines.

  • A critical bug in llama.cpp was splitting double newlines into separate tokens, causing the model to lose coherence in long-form reasoning sessions.
  • Specialized tool-call tokens were incorrectly classified as user-defined instead of control tokens, preventing the parser from identifying reasoning boundaries.
  • Users on build b8693 should verify their GGUF files; those exported before April 4, 2026, lack essential tokenizer fixes regardless of the runtime version.
  • Stability requires the `--jinja` flag to correctly process the model's internal templates and setting `--min-p 0.0` to avoid interference with Gemma 4's sampling logic.
  • A new "Unified KV Cache" update has significantly improved VRAM efficiency, but users must update to the latest master branch to benefit from the reduced memory footprint.
// TAGS
gemma-4-31bllama-cppllmreasoningopen-weightsself-hosted

DISCOVERED

4d ago

2026-04-08

PUBLISHED

4d ago

2026-04-07

RELEVANCE

9/ 10

AUTHOR

Express_Quail_1493