De Freitas proposes causal interactive training
Microsoft AI VP Nando de Freitas proposes a unified training framework for AI agents based on continual, causal interaction streams rather than multi-stage fine-tuning pipelines. By treating world-written tokens as evidence and self-written tokens as interventions, the method achieves competitive reasoning performance with a simpler, single-stream objective.
The proposed framework challenges the complex, multi-stage training recipes of modern LLMs in favor of a single, theoretically grounded interaction stream.
- –Multi-stage pipelines (SFT, RLHF, GRPO) are criticized as a research local minimum that lacks clean mathematical semantics for interaction histories.
- –By distinguishing between evidence (world-written tokens) and interventions (agent-written tokens), the model simplifies training using a loss mask.
- –A STEM reasoning experiment shows the causal agent matches the performance of complex reinforcement learning methods like GRPO.
- –The approach draws on universal artificial intelligence as imitation, shifting agent goals from reward maximization to action prediction.
DISCOVERED
1h ago
2026-06-25
PUBLISHED
17d ago
2026-06-07
RELEVANCE
AUTHOR
NandoDF