OPEN_SOURCE ↗
REDDIT · REDDIT// 37d agoRESEARCH PAPER
OpenAI unveils CoT-Control safety eval
OpenAI published a new research paper and eval suite called CoT-Control to measure whether reasoning models can deliberately shape or suppress their chain-of-thought traces. The early result is encouraging for AI safety: current reasoning models appear much worse at controlling their internal reasoning than at controlling final answers, so CoT monitoring still looks useful as an oversight signal.
// ANALYSIS
This is one of the more important safety papers of the week because it turns a vague fear—models hiding their reasoning—into a concrete benchmark with surprisingly low current scores.
- –CoT-Control pairs real benchmark tasks with explicit constraints on the reasoning trace, such as avoiding key words or changing style, to test whether a model can solve the task while masking how it thinks.
- –The paper covers more than 14,000 tasks drawn from benchmarks like GPQA, MMLU-Pro, HLE, BFCL, and SWE-Bench Verified, which makes it much more substantive than a toy safety demo.
- –OpenAI says GPT-5.4 Thinking controls only about 0.3% of 10k-character chains of thought, close to GPT-5.2 Thinking at 0.2%, suggesting better reasoning did not obviously make hidden reasoning easier.
- –The broader research result is more nuanced: controllability rises with model size, but falls with more RL reasoning training, more test-time compute, and harder tasks, so this is not a permanently solved problem.
- –For developers building agent systems, the practical takeaway is that reasoning traces remain a useful monitoring surface today, but labs now need to track this metric continuously as models get more situationally aware.
// TAGS
cot-controlreasoningsafetybenchmarkresearch
DISCOVERED
37d ago
2026-03-06
PUBLISHED
37d ago
2026-03-05
RELEVANCE
9/ 10
AUTHOR
likeastar20