BACK_TO_FEEDAICRIER_2
"Uncensored" AI models still exhibit persistent safety bias
OPEN_SOURCE ↗
HN · HACKER_NEWS// 4h agoRESEARCH PAPER

"Uncensored" AI models still exhibit persistent safety bias

New research into the "flinch" phenomenon shows that AI models marketed as "uncensored" still quietly nudge outputs away from controversial or charged language. By measuring the "flinch score," Morgin.ai argues that refusal ablation removes visible refusals but leaves underlying word-level bias intact.

// ANALYSIS

The "flinch" metric proves that truly neutral AI is currently an illusion, as safety guardrails are baked into the probability distributions of base models long before any "uncensoring" fine-tuning occurs.

  • "Refusal ablation" only removes the visible "I can't help with that" response, leaving the underlying distributional bias inherited from pretraining intact.
  • Transparently trained models like OLMo and Pythia set the baseline for the least amount of "flinching," while commercial models like Gemma show significantly higher scores.
  • The research shows that "abliterating" a model can paradoxically increase its flinch score, suggesting safety layers and core distributions are deeply intertwined.
  • This invisible filtering provides a mechanism for subtle information control that is much harder for users to detect or bypass than a standard refusal.
  • For developers, this research highlights that "uncensored" models are not a silver bullet for achieving neutral or highly creative outputs in sensitive domains.
// TAGS
llmsafetyethicsopen-weightsresearchmorgin-ai

DISCOVERED

4h ago

2026-04-21

PUBLISHED

13h ago

2026-04-20

RELEVANCE

8/ 10

AUTHOR

llmmadness