"Uncensored" AI models still exhibit persistent safety bias

// 45d agoRESEARCH PAPER

"Uncensored" AI models still exhibit persistent safety bias

New research into the "flinch" phenomenon shows that AI models marketed as "uncensored" still quietly nudge outputs away from controversial or charged language. By measuring the "flinch score," Morgin.ai argues that refusal ablation removes visible refusals but leaves underlying word-level bias intact.

// ANALYSIS

The "flinch" metric proves that truly neutral AI is currently an illusion, as safety guardrails are baked into the probability distributions of base models long before any "uncensoring" fine-tuning occurs.

–"Refusal ablation" only removes the visible "I can't help with that" response, leaving the underlying distributional bias inherited from pretraining intact.
–Transparently trained models like OLMo and Pythia set the baseline for the least amount of "flinching," while commercial models like Gemma show significantly higher scores.
–The research shows that "abliterating" a model can paradoxically increase its flinch score, suggesting safety layers and core distributions are deeply intertwined.
–This invisible filtering provides a mechanism for subtle information control that is much harder for users to detect or bypass than a standard refusal.
–For developers, this research highlights that "uncensored" models are not a silver bullet for achieving neutral or highly creative outputs in sensitive domains.

// TAGS

llmsafetyethicsopen-weightsresearchmorgin-ai

DISCOVERED

45d ago

2026-04-21

PUBLISHED

45d ago

2026-04-20

RELEVANCE

8/ 10

AUTHOR

llmmadness

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

NEWS3m ago

Open-source developer tool Plannotator celebrates reaching 100 contributors and crossing 100 open issues.

Plannotator, an open-source visual plan-review tool for AI coding agents, has reached major project milestones, crossing 100 contributors (comprising both humans and AI agents) and 100 open issues. The creator also noted that a VC-backed startup has already pivoted to create an exact copy of the tool, illustrating the rapid pace of replication and interest in the agent UI and developer experience space.

UPDATE15m ago

Grok Build now operates directly within local project files with full read and write capabilities.

Grok Build has been updated to run natively within local workspaces, giving it full read and write permissions to project directories. This update ensures that all file reading, writing, and new file generation happen directly inside the project folder, eliminating manual file management steps.

MODEL22m ago

Sourceful launches Riverflow 2.5 on OpenRouter

Sourceful has launched Riverflow 2.5 on OpenRouter, an image generation model featuring controllable reasoning effort and a custom scoring rubric. Available in Fast and Pro tiers, it supports custom fonts, background isolation, and multi-image inputs of up to 10 images.