GLM-5.2 adopts critic-based PPO training
Zhipu AI's release of the open-weights model GLM-5.2 marks a design pivot back to critic-based PPO training rather than group-wise variance reduction like GRPO. This critic-based setup computes token-level advantages for individual rollouts, enabling better support for compaction in long-horizon agentic tasks and 1M-token context windows.
While the industry recently rushed toward group-based reinforcement learning (like GRPO) to save on critic-model compute overhead, Zhipu's pivot back to critic-based PPO shows that group-wise variance reduction fails to scale for ultra-long contexts and complex, multi-turn reasoning.
- –The Critic's Return: Using a dedicated critic allows the model to compute token-level advantages, which is crucial when tracing complex, long-horizon decision trees.
- –Compaction Support: Group-wise methods struggle with comparing relative outputs when traces are compacted; the critic-based setup natively supports compacted sub-traces.
- –Agentic Specialization: This structural change positions GLM-5.2 as a highly capable model for project-scale engineering and multi-step agent actions.
DISCOVERED
1h ago
2026-06-17
PUBLISHED
1h ago
2026-06-17
RELEVANCE
AUTHOR
jeremyphoward