BACK_TO_FEEDAICRIER_2
Gemma 4 tanks in early hallucination tests
OPEN_SOURCE ↗
REDDIT · REDDIT// 4d agoBENCHMARK RESULT

Gemma 4 tanks in early hallucination tests

Just days after Google’s frontier-level Gemma 4 launch, developers on r/LocalLLaMA are reporting alarming hallucination rates in RAG benchmarks. Despite topping reasoning leaderboards, the models—particularly the 26B A4B MoE—struggle with factual grounding in complex retrieval tasks.

// ANALYSIS

Google’s pivot to "Thinking Mode" reasoning seems to have come at the expense of retrieval-augmented precision.

  • The flagship 31B Dense and 26B MoE models are excelling in synthetic reasoning benchmarks but failing practical grounding tests in RAG pipelines.
  • Users report that the E4B and E2B edge variants are particularly prone to "nonsense" responses when RAG context is missing, suggesting a lack of robust refusal mechanisms.
  • The 26B A4B variant’s MoE routing may be sacrificing factual fidelity for its 4B-class inference speed, a trade-off that is proving controversial for production deployment.
  • This feedback mirrors a broader industry trend where increased "reasoning" parameters sometimes lead to more convincing but equally incorrect hallucinations.
  • Comparison benchmarks show Gemma 4 trailing behind competitors like Grok 4.20 in reliability, a significant blow to its "intelligence-first" branding.
// TAGS
gemma-4google-deepmindllmragbenchmarkhallucinationopen-weights

DISCOVERED

4d ago

2026-04-07

PUBLISHED

4d ago

2026-04-07

RELEVANCE

10/ 10

AUTHOR

Fusseldieb