OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoNEWS
Llama.cpp samplers fail on Gemma 4 architecture
Users report that llama.cpp samplers are essentially ignored by the new Gemma 4 models, leading to repetitive, deterministic outputs even at extreme temperatures. The regression is linked to missing logit soft-capping support and recent architectural changes in backend sampling.
// ANALYSIS
The "no variance" bug in Gemma 4 isn't just a settings issue; it's an architectural mismatch between legacy samplers and new logit dynamics.
- –Gemma 4's logit soft-capping requires specialized handling that was only recently stabilized in PR #21390.
- –The migration of sampling logic directly into the CUDA computation graph has introduced state synchronization errors for floating-point parameters.
- –Users are seeing coherent output at temperature 1000 because the sampler is defaulting to greedy decoding when it encounters invalid logit states.
- –Re-downloading GGUFs is mandatory for many as early April 2026 quants lacked the necessary `add_bos` and metadata for Gemma 4's specific reasoning budget.
- –The community is pivoting toward "Min P" and the new "Reasoning Budget Sampler" as the only reliable way to maintain quality on these newer architectures.
// TAGS
llama-cppgemma-4inferencesamplingopen-sourcellm
DISCOVERED
4h ago
2026-04-19
PUBLISHED
6h ago
2026-04-19
RELEVANCE
8/ 10
AUTHOR
kaisurniwurer