BACK_TO_FEEDAICRIER_2
Heretic Slashes Gemma 4 Refusals, Still Tough
OPEN_SOURCE ↗
REDDIT · REDDIT// 5d agoBENCHMARK RESULT

Heretic Slashes Gemma 4 Refusals, Still Tough

Heretic’s automated ablation run on Gemma 4 E4B-it, performed within two days of release, cut refusals from roughly 98% to 47.5% across 200 trials. That is a meaningful improvement, but Gemma 4 still held up far harder than Gemma 3, where Heretic got refusals down to 3%.

// ANALYSIS

This looks less like a clean victory than a limit test for directional ablation: Heretic can punch through broad refusal behavior, but Gemma 4 appears to preserve deeper safety structure that survives the first pass.

  • The model seems to lose topic-level refusal gates more easily than prompt-specific or operational refusals, which suggests the safety signal is distributed rather than concentrated in one linear direction.
  • The Gemma 3 vs Gemma 4 gap is the real story: newer models may have more layered alignment machinery, so a single ablation pipeline scales worse as models get more sophisticated.
  • The patching work matters as much as the result; if a model needs transformer v5, tokenizer fixes, and PEFT compatibility shims, reproducibility is part of the achievement.
  • The unchanged dual-use technical prompts show this is not a blanket capability rewrite, but a selective reduction in refusal behavior with uneven effects across prompt classes.
// TAGS
hereticgemma-4llmbenchmarkopen-sourceresearch

DISCOVERED

5d ago

2026-04-06

PUBLISHED

6d ago

2026-04-06

RELEVANCE

8/ 10

AUTHOR

mattezell