Alignment Tightens Models, Not Their Truthfulness
This preprint argues that post-training makes LLMs more decisive without making them more accurate or truthful. Across 3 architectures and 4 RL methods, the “commitment layer” stays fixed while internal representations compress more tightly around the decision point.
The sharp read here is that alignment is reshaping how models commit to answers, not whether they actually know the right ones. That matters because it suggests safer-sounding outputs can still leave truthfulness untouched.
- –The paper’s main result is structural: RL methods tighten the lock-in point rather than moving it.
- –If that holds broadly, confidence and compliance metrics are not enough to judge alignment quality.
- –Developers should expect post-training to improve decisiveness, refusal style, and answer commitment before it improves epistemic reliability.
- –The result reinforces a familiar warning: alignment can optimize behavior that looks better to users without improving underlying factuality.
DISCOVERED
45d ago
2026-04-27
PUBLISHED
45d ago
2026-04-27
RELEVANCE
AUTHOR
141_1337