REDDIT · REDDIT// 4d agoBENCHMARK RESULT

Gemma 4 tanks in early hallucination tests

Just days after Google’s frontier-level Gemma 4 launch, developers on r/LocalLLaMA are reporting alarming hallucination rates in RAG benchmarks. Despite topping reasoning leaderboards, the models—particularly the 26B A4B MoE—struggle with factual grounding in complex retrieval tasks.

// ANALYSIS

Google’s pivot to "Thinking Mode" reasoning seems to have come at the expense of retrieval-augmented precision.

–The flagship 31B Dense and 26B MoE models are excelling in synthetic reasoning benchmarks but failing practical grounding tests in RAG pipelines.
–Users report that the E4B and E2B edge variants are particularly prone to "nonsense" responses when RAG context is missing, suggesting a lack of robust refusal mechanisms.
–The 26B A4B variant’s MoE routing may be sacrificing factual fidelity for its 4B-class inference speed, a trade-off that is proving controversial for production deployment.
–This feedback mirrors a broader industry trend where increased "reasoning" parameters sometimes lead to more convincing but equally incorrect hallucinations.
–Comparison benchmarks show Gemma 4 trailing behind competitors like Grok 4.20 in reliability, a significant blow to its "intelligence-first" branding.

// TAGS

gemma-4google-deepmindllmragbenchmarkhallucinationopen-weights

DISCOVERED

4d ago

2026-04-07

PUBLISHED

4d ago

2026-04-07

RELEVANCE

10/ 10

AUTHOR

Fusseldieb