A reference-free auditing method detects planted behaviors in LLMs by identifying late-layer activation residuals that deviate from early-layer predictions.
This research presents a method for detecting hidden, planted behaviors in LLMs without requiring a reference base model. By analyzing the residuals between early-layer and late-layer activations via Ridge regression, the approach identifies where models deviate from their internal logic, effectively "ratting themselves out." Tested on Anthropic's AuditBench organisms, the method achieved AUROC scores up to 0.889, matching or exceeding reference-based benchmarks. The project also introduces a "topic funnel" and probe specificity filters to distinguish between broad RLHF-induced biases and narrow, topic-specific fine-tuning.
Residual analysis reveals that fine-tuning leaves a distinct "scar" in activation space that doesn't require a baseline to find, potentially democratizing model auditing for closed-source weights.
- –Effectively separates narrow planted behaviors from broad RLHF progressive tendencies in base models.
- –Achieves high AUROC (0.800-0.889) without needing the original base model for comparison.
- –Demonstrates feasibility on quantized (NF4) 70B models, making high-end auditing accessible to researchers with limited compute.
- –Topic funneling provides a standalone tool for auditing model "opinions" even without direct activation access.
DISCOVERED
6d ago
2026-04-06
PUBLISHED
6d ago
2026-04-05
RELEVANCE
AUTHOR
bmarti644