OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoOPENSOURCE RELEASE
CogArch trains LLMs via competitive self-play
CogArch is an open-source self-improvement framework where two LLMs compete to solve coding problems, using unit test execution to generate DPO training pairs for verifiable alignment without human labels.
// ANALYSIS
CogArch demonstrates that verifiable rewards (code execution) can successfully drive model improvement without human-in-the-loop, mirroring the techniques used by top-tier reasoning models like o1 and DeepSeek-R1.
- –Replacing the standard "judge model" with raw execution results eliminates model bias and ensures a ground-truth reward signal.
- –The use of DPO instead of PPO or GRPO makes the training loop stable and computationally accessible for developers with local hardware.
- –A sophisticated memory system allows agents to retrieve and learn from past errors, such as off-by-one errors, before their first attempt at a new problem.
- –Multi-specialist agents with varying temperatures ensure high diversity in generated solutions, which is critical for creating high-quality preference pairs.
- –Early results showing a +1.2pp gain on HumanEval from just 39 training pairs highlight the high sample efficiency of this competitive approach.
// TAGS
cogarchai-codingllmfine-tuningagentopen-sourcereasoning
DISCOVERED
3h ago
2026-04-16
PUBLISHED
17h ago
2026-04-16
RELEVANCE
8/ 10
AUTHOR
Outrageous_Mark9761