CogArch trains LLMs via competitive self-play
CogArch is an open-source self-improvement framework where two LLMs compete to solve coding problems, using unit test execution to generate DPO training pairs for verifiable alignment without human labels.
CogArch demonstrates that verifiable rewards (code execution) can successfully drive model improvement without human-in-the-loop, mirroring the techniques used by top-tier reasoning models like o1 and DeepSeek-R1.
- –Replacing the standard "judge model" with raw execution results eliminates model bias and ensures a ground-truth reward signal.
- –The use of DPO instead of PPO or GRPO makes the training loop stable and computationally accessible for developers with local hardware.
- –A sophisticated memory system allows agents to retrieve and learn from past errors, such as off-by-one errors, before their first attempt at a new problem.
- –Multi-specialist agents with varying temperatures ensure high diversity in generated solutions, which is critical for creating high-quality preference pairs.
- –Early results showing a +1.2pp gain on HumanEval from just 39 training pairs highlight the high sample efficiency of this competitive approach.
DISCOVERED
45d ago
2026-04-16
PUBLISHED
45d ago
2026-04-16
RELEVANCE
AUTHOR
Outrageous_Mark9761