REDDIT · REDDIT// 3h agoRESEARCH PAPER

Representation over Routing fixes PPO collapse

This preprint argues that multi-timescale advantage routing in PPO can collapse because the router learns to game the surrogate loss or drifts into myopic weighting. The accompanying PyTorch MRE shows a simple target-decoupling fix: keep multi-timescale signals on the critic, but update the actor only with the long-term advantage.

// ANALYSIS

The interesting part here is the diagnosis: the routing mechanism itself becomes an optimization target, so the policy learns to exploit the loss rather than improve control. Decoupling representation learning from action selection is the cleaner move.

–Exposing temporal routing weights to policy gradients creates a shortcut for surrogate-objective hacking.
–Gradient-free variance weighting favors low-variance short horizons, which explains the hovering, reward-hoarding behavior in LunarLander-v2.
–Keeping multi-timescale heads on the critic preserves auxiliary representation learning without letting the actor manipulate the router.
–The 4-stage MRE is valuable because it makes the failure modes and the fix reproducible in a small, inspectable PyTorch codebase.
–If the results generalize, this is a good cautionary example for any RL system that tries to route credit across horizons inside the policy path.

// TAGS

representation-over-routingresearchopen-sourceagent

DISCOVERED

3h ago

2026-04-16

PUBLISHED

20h ago

2026-04-16

RELEVANCE

8/ 10

AUTHOR

dlwlrma_22