Heretic Slashes Gemma 4 Refusals, Still Tough

// 51d agoBENCHMARK RESULT

Heretic Slashes Gemma 4 Refusals, Still Tough

Heretic’s automated ablation run on Gemma 4 E4B-it, performed within two days of release, cut refusals from roughly 98% to 47.5% across 200 trials. That is a meaningful improvement, but Gemma 4 still held up far harder than Gemma 3, where Heretic got refusals down to 3%.

// ANALYSIS

This looks less like a clean victory than a limit test for directional ablation: Heretic can punch through broad refusal behavior, but Gemma 4 appears to preserve deeper safety structure that survives the first pass.

–The model seems to lose topic-level refusal gates more easily than prompt-specific or operational refusals, which suggests the safety signal is distributed rather than concentrated in one linear direction.
–The Gemma 3 vs Gemma 4 gap is the real story: newer models may have more layered alignment machinery, so a single ablation pipeline scales worse as models get more sophisticated.
–The patching work matters as much as the result; if a model needs transformer v5, tokenizer fixes, and PEFT compatibility shims, reproducibility is part of the achievement.
–The unchanged dual-use technical prompts show this is not a blanket capability rewrite, but a selective reduction in refusal behavior with uneven effects across prompt classes.

// TAGS

hereticgemma-4llmbenchmarkopen-sourceresearch

DISCOVERED

51d ago

2026-04-06

PUBLISHED

51d ago

2026-04-06

RELEVANCE

8/ 10

AUTHOR

mattezell

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE1h ago

Cursor adds dedicated subagents for skills

Cursor now allows developers to execute tool-heavy or research-intensive agent skills within dedicated subagents. This architectural shift isolates noisy background tasks, keeping the main chat context clean and focused.

UPDATE1h ago

YouTube moves AI labels to video player

YouTube is moving its AI content disclosures from video descriptions to more prominent placements beneath the player and on Shorts overlays. Starting in May, the platform will use internal signals to automatically label photorealistic AI content that creators fail to disclose.

OPEN SOURCE5h ago

Taste Skill kills AI "frontend slop"

Taste-Skill is an open-source framework that provides portable "agent skills" to enforce high-end design principles in AI-generated code. By injecting specific design directives and "anti-slop" rules, it enables LLMs to produce editorial-grade UIs that bypass generic, boilerplate-heavy AI templates.