OPEN_SOURCE ↗
REDDIT · REDDIT// 4d agoBENCHMARK RESULT
Gemma 4 tanks in early hallucination tests
Just days after Google’s frontier-level Gemma 4 launch, developers on r/LocalLLaMA are reporting alarming hallucination rates in RAG benchmarks. Despite topping reasoning leaderboards, the models—particularly the 26B A4B MoE—struggle with factual grounding in complex retrieval tasks.
// ANALYSIS
Google’s pivot to "Thinking Mode" reasoning seems to have come at the expense of retrieval-augmented precision.
- –The flagship 31B Dense and 26B MoE models are excelling in synthetic reasoning benchmarks but failing practical grounding tests in RAG pipelines.
- –Users report that the E4B and E2B edge variants are particularly prone to "nonsense" responses when RAG context is missing, suggesting a lack of robust refusal mechanisms.
- –The 26B A4B variant’s MoE routing may be sacrificing factual fidelity for its 4B-class inference speed, a trade-off that is proving controversial for production deployment.
- –This feedback mirrors a broader industry trend where increased "reasoning" parameters sometimes lead to more convincing but equally incorrect hallucinations.
- –Comparison benchmarks show Gemma 4 trailing behind competitors like Grok 4.20 in reliability, a significant blow to its "intelligence-first" branding.
// TAGS
gemma-4google-deepmindllmragbenchmarkhallucinationopen-weights
DISCOVERED
4d ago
2026-04-07
PUBLISHED
4d ago
2026-04-07
RELEVANCE
10/ 10
AUTHOR
Fusseldieb