YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

"Uncensored" AI models still exhibit persistent safety bias

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

"Uncensored" AI models still exhibit persistent safety bias
OPEN LINK ↗
// 45d agoRESEARCH PAPER

"Uncensored" AI models still exhibit persistent safety bias

New research into the "flinch" phenomenon shows that AI models marketed as "uncensored" still quietly nudge outputs away from controversial or charged language. By measuring the "flinch score," Morgin.ai argues that refusal ablation removes visible refusals but leaves underlying word-level bias intact.

// ANALYSIS

The "flinch" metric proves that truly neutral AI is currently an illusion, as safety guardrails are baked into the probability distributions of base models long before any "uncensoring" fine-tuning occurs.

  • "Refusal ablation" only removes the visible "I can't help with that" response, leaving the underlying distributional bias inherited from pretraining intact.
  • Transparently trained models like OLMo and Pythia set the baseline for the least amount of "flinching," while commercial models like Gemma show significantly higher scores.
  • The research shows that "abliterating" a model can paradoxically increase its flinch score, suggesting safety layers and core distributions are deeply intertwined.
  • This invisible filtering provides a mechanism for subtle information control that is much harder for users to detect or bypass than a standard refusal.
  • For developers, this research highlights that "uncensored" models are not a silver bullet for achieving neutral or highly creative outputs in sensitive domains.
// TAGS
llmsafetyethicsopen-weightsresearchmorgin-ai

DISCOVERED

45d ago

2026-04-21

PUBLISHED

45d ago

2026-04-20

RELEVANCE

8/ 10

AUTHOR

llmmadness