OPEN_SOURCE ↗
HN · HACKER_NEWS// 4h agoRESEARCH PAPER
"Uncensored" AI models still exhibit persistent safety bias
New research into the "flinch" phenomenon shows that AI models marketed as "uncensored" still quietly nudge outputs away from controversial or charged language. By measuring the "flinch score," Morgin.ai argues that refusal ablation removes visible refusals but leaves underlying word-level bias intact.
// ANALYSIS
The "flinch" metric proves that truly neutral AI is currently an illusion, as safety guardrails are baked into the probability distributions of base models long before any "uncensoring" fine-tuning occurs.
- –"Refusal ablation" only removes the visible "I can't help with that" response, leaving the underlying distributional bias inherited from pretraining intact.
- –Transparently trained models like OLMo and Pythia set the baseline for the least amount of "flinching," while commercial models like Gemma show significantly higher scores.
- –The research shows that "abliterating" a model can paradoxically increase its flinch score, suggesting safety layers and core distributions are deeply intertwined.
- –This invisible filtering provides a mechanism for subtle information control that is much harder for users to detect or bypass than a standard refusal.
- –For developers, this research highlights that "uncensored" models are not a silver bullet for achieving neutral or highly creative outputs in sensitive domains.
// TAGS
llmsafetyethicsopen-weightsresearchmorgin-ai
DISCOVERED
4h ago
2026-04-21
PUBLISHED
13h ago
2026-04-20
RELEVANCE
8/ 10
AUTHOR
llmmadness