OPEN_SOURCE ↗
REDDIT · REDDIT// 5d agoBENCHMARK RESULT
Heretic Slashes Gemma 4 Refusals, Still Tough
Heretic’s automated ablation run on Gemma 4 E4B-it, performed within two days of release, cut refusals from roughly 98% to 47.5% across 200 trials. That is a meaningful improvement, but Gemma 4 still held up far harder than Gemma 3, where Heretic got refusals down to 3%.
// ANALYSIS
This looks less like a clean victory than a limit test for directional ablation: Heretic can punch through broad refusal behavior, but Gemma 4 appears to preserve deeper safety structure that survives the first pass.
- –The model seems to lose topic-level refusal gates more easily than prompt-specific or operational refusals, which suggests the safety signal is distributed rather than concentrated in one linear direction.
- –The Gemma 3 vs Gemma 4 gap is the real story: newer models may have more layered alignment machinery, so a single ablation pipeline scales worse as models get more sophisticated.
- –The patching work matters as much as the result; if a model needs transformer v5, tokenizer fixes, and PEFT compatibility shims, reproducibility is part of the achievement.
- –The unchanged dual-use technical prompts show this is not a blanket capability rewrite, but a selective reduction in refusal behavior with uneven effects across prompt classes.
// TAGS
hereticgemma-4llmbenchmarkopen-sourceresearch
DISCOVERED
5d ago
2026-04-06
PUBLISHED
6d ago
2026-04-06
RELEVANCE
8/ 10
AUTHOR
mattezell