AREW breaks self-locking in LLM agents
Researchers from CUHK, UCSD, Georgia Tech, and ByteDance identify "information self-locking" — a failure mode where RL-trained agents stop asking useful questions and fail to integrate answers — and fix it with Advantage Reweighting (AREW), a lightweight plug-in that adds binary step-level critiques to standard policy gradients. The technique achieves up to 62 percentage points of improvement across active reasoning benchmarks without redesigning the reward structure.
AREW is one of those rare RL fixes that's both theoretically clean and empirically decisive — a 62-point swing on PE-G isn't noise, it's a regime change in what RL-trained agents can actually do.
- –Identifies a genuine failure loop: weak action selection → uninformative queries → weak belief tracking → even weaker queries; AREW injects directional feedback to break the deadlock at the step level
- –Works as an additive shaping term on top of any policy gradient algorithm (PPO, GRPO, etc.) — no reward redesign, no architecture changes, minimal integration cost
- –Binary critiques (did this query reveal new information?) are cheap to obtain from the environment, making the method practical for real deployments
- –Results hold across 27 of 28 evaluated settings spanning medical diagnosis, preference estimation, and troubleshooting dialogue — broad applicability signal
- –No code released yet, but the method's simplicity means practitioners can implement it from the paper alone
DISCOVERED
73d ago
2026-03-15
PUBLISHED
73d ago
2026-03-15
RELEVANCE
AUTHOR
Discover AI