YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Heretic Slashes Gemma 4 Refusals, Still Tough

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Heretic Slashes Gemma 4 Refusals, Still Tough
OPEN LINK ↗
// 51d agoBENCHMARK RESULT

Heretic Slashes Gemma 4 Refusals, Still Tough

Heretic’s automated ablation run on Gemma 4 E4B-it, performed within two days of release, cut refusals from roughly 98% to 47.5% across 200 trials. That is a meaningful improvement, but Gemma 4 still held up far harder than Gemma 3, where Heretic got refusals down to 3%.

// ANALYSIS

This looks less like a clean victory than a limit test for directional ablation: Heretic can punch through broad refusal behavior, but Gemma 4 appears to preserve deeper safety structure that survives the first pass.

  • The model seems to lose topic-level refusal gates more easily than prompt-specific or operational refusals, which suggests the safety signal is distributed rather than concentrated in one linear direction.
  • The Gemma 3 vs Gemma 4 gap is the real story: newer models may have more layered alignment machinery, so a single ablation pipeline scales worse as models get more sophisticated.
  • The patching work matters as much as the result; if a model needs transformer v5, tokenizer fixes, and PEFT compatibility shims, reproducibility is part of the achievement.
  • The unchanged dual-use technical prompts show this is not a blanket capability rewrite, but a selective reduction in refusal behavior with uneven effects across prompt classes.
// TAGS
hereticgemma-4llmbenchmarkopen-sourceresearch

DISCOVERED

51d ago

2026-04-06

PUBLISHED

51d ago

2026-04-06

RELEVANCE

8/ 10

AUTHOR

mattezell