YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

A reference-free auditing method detects planted behaviors in LLMs by identifying late-layer activation residuals that deviate from early-layer predictions.

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

A reference-free auditing method detects planted behaviors in LLMs by identifying late-layer activation residuals that deviate from early-layer predictions.
OPEN LINK ↗
// 51d agoRESEARCH PAPER

A reference-free auditing method detects planted behaviors in LLMs by identifying late-layer activation residuals that deviate from early-layer predictions.

This research presents a method for detecting hidden, planted behaviors in LLMs without requiring a reference base model. By analyzing the residuals between early-layer and late-layer activations via Ridge regression, the approach identifies where models deviate from their internal logic, effectively "ratting themselves out." Tested on Anthropic's AuditBench organisms, the method achieved AUROC scores up to 0.889, matching or exceeding reference-based benchmarks. The project also introduces a "topic funnel" and probe specificity filters to distinguish between broad RLHF-induced biases and narrow, topic-specific fine-tuning.

// ANALYSIS

Residual analysis reveals that fine-tuning leaves a distinct "scar" in activation space that doesn't require a baseline to find, potentially democratizing model auditing for closed-source weights.

  • Effectively separates narrow planted behaviors from broad RLHF progressive tendencies in base models.
  • Achieves high AUROC (0.800-0.889) without needing the original base model for comparison.
  • Demonstrates feasibility on quantized (NF4) 70B models, making high-end auditing accessible to researchers with limited compute.
  • Topic funneling provides a standalone tool for auditing model "opinions" even without direct activation access.
// TAGS
llm auditinginterpretabilityai safetyactivation residualsrlhf biasauditbenchmachine learning

DISCOVERED

51d ago

2026-04-06

PUBLISHED

51d ago

2026-04-05

RELEVANCE

8/ 10

AUTHOR

bmarti644