A reference-free auditing method detects planted behaviors in LLMs by identifying late-layer activation residuals that deviate from early-layer predictions.

// 97d agoRESEARCH PAPER

A reference-free auditing method detects planted behaviors in LLMs by identifying late-layer activation residuals that deviate from early-layer predictions.

This research presents a method for detecting hidden, planted behaviors in LLMs without requiring a reference base model. By analyzing the residuals between early-layer and late-layer activations via Ridge regression, the approach identifies where models deviate from their internal logic, effectively "ratting themselves out." Tested on Anthropic's AuditBench organisms, the method achieved AUROC scores up to 0.889, matching or exceeding reference-based benchmarks. The project also introduces a "topic funnel" and probe specificity filters to distinguish between broad RLHF-induced biases and narrow, topic-specific fine-tuning.

// ANALYSIS

Residual analysis reveals that fine-tuning leaves a distinct "scar" in activation space that doesn't require a baseline to find, potentially democratizing model auditing for closed-source weights.

–Effectively separates narrow planted behaviors from broad RLHF progressive tendencies in base models.
–Achieves high AUROC (0.800-0.889) without needing the original base model for comparison.
–Demonstrates feasibility on quantized (NF4) 70B models, making high-end auditing accessible to researchers with limited compute.
–Topic funneling provides a standalone tool for auditing model "opinions" even without direct activation access.

// TAGS

llm auditinginterpretabilityai safetyactivation residualsrlhf biasauditbenchmachine learning

DISCOVERED

97d ago

2026-04-06

PUBLISHED

97d ago

2026-04-05

RELEVANCE

8/ 10

AUTHOR

bmarti644

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE22m ago

OpenDisplay turns iOS devices into Mac monitors

OpenDisplay is an open-source utility that streams macOS desktops to iPads or iPhones over USB or Wi-Fi, turning them into low-latency, high-resolution external monitors. Leveraging macOS's private CGVirtualDisplay API, ScreenCaptureKit, and VideoToolbox, it integrates directly into macOS Display settings as a true extended display without needing external servers or telemetry.

OPEN SOURCE22m ago

NASA releases SpaceWasm flight WebAssembly interpreter

spacewasm is a WebAssembly interpreter developed by NASA and Caltech for safety-critical flight software. Written in Rust, it decodes Wasm modules in a single pass into an optimized intermediate representation and utilizes a custom memory model with fixed-size allocation pages to guarantee deterministic execution and avoid memory panics in resource-constrained embedded systems.

OPEN SOURCE22m ago

Agent Skills guides agent UI design

Agent Skills is an open-source library and prompting system designed to help front-end coding agents like Cursor and Claude Code build premium user interfaces. The project provides reusable design guardrails and procedural workflows for advanced styling, GSAP animations, and WebGL.