BACK_TO_FEEDAICRIER_2
CogArch trains LLMs via competitive self-play
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoOPENSOURCE RELEASE

CogArch trains LLMs via competitive self-play

CogArch is an open-source self-improvement framework where two LLMs compete to solve coding problems, using unit test execution to generate DPO training pairs for verifiable alignment without human labels.

// ANALYSIS

CogArch demonstrates that verifiable rewards (code execution) can successfully drive model improvement without human-in-the-loop, mirroring the techniques used by top-tier reasoning models like o1 and DeepSeek-R1.

  • Replacing the standard "judge model" with raw execution results eliminates model bias and ensures a ground-truth reward signal.
  • The use of DPO instead of PPO or GRPO makes the training loop stable and computationally accessible for developers with local hardware.
  • A sophisticated memory system allows agents to retrieve and learn from past errors, such as off-by-one errors, before their first attempt at a new problem.
  • Multi-specialist agents with varying temperatures ensure high diversity in generated solutions, which is critical for creating high-quality preference pairs.
  • Early results showing a +1.2pp gain on HumanEval from just 39 training pairs highlight the high sample efficiency of this competitive approach.
// TAGS
cogarchai-codingllmfine-tuningagentopen-sourcereasoning

DISCOVERED

3h ago

2026-04-16

PUBLISHED

17h ago

2026-04-16

RELEVANCE

8/ 10

AUTHOR

Outrageous_Mark9761