OPEN_SOURCE ↗
REDDIT · REDDIT// 4d agoMODEL RELEASE
Gemma 4 31B loops in llama.cpp
Users are reporting infinite looping issues with Google's new Gemma 4 31B model in recent llama.cpp builds. The bug stems from a combination of incorrect newline tokenization and misclassified control tokens that disrupt the model's new reasoning and tool-calling architecture.
// ANALYSIS
Gemma 4's specialized "thinking" mode is exposing deep-seated tokenizer assumptions in local inference engines.
- –A critical bug in llama.cpp was splitting double newlines into separate tokens, causing the model to lose coherence in long-form reasoning sessions.
- –Specialized tool-call tokens were incorrectly classified as user-defined instead of control tokens, preventing the parser from identifying reasoning boundaries.
- –Users on build b8693 should verify their GGUF files; those exported before April 4, 2026, lack essential tokenizer fixes regardless of the runtime version.
- –Stability requires the `--jinja` flag to correctly process the model's internal templates and setting `--min-p 0.0` to avoid interference with Gemma 4's sampling logic.
- –A new "Unified KV Cache" update has significantly improved VRAM efficiency, but users must update to the latest master branch to benefit from the reduced memory footprint.
// TAGS
gemma-4-31bllama-cppllmreasoningopen-weightsself-hosted
DISCOVERED
4d ago
2026-04-08
PUBLISHED
4d ago
2026-04-07
RELEVANCE
9/ 10
AUTHOR
Express_Quail_1493