OpenAI unveils CoT-Control safety eval

// 83d agoRESEARCH PAPER

OpenAI unveils CoT-Control safety eval

OpenAI published a new research paper and eval suite called CoT-Control to measure whether reasoning models can deliberately shape or suppress their chain-of-thought traces. The early result is encouraging for AI safety: current reasoning models appear much worse at controlling their internal reasoning than at controlling final answers, so CoT monitoring still looks useful as an oversight signal.

// ANALYSIS

This is one of the more important safety papers of the week because it turns a vague fear—models hiding their reasoning—into a concrete benchmark with surprisingly low current scores.

–CoT-Control pairs real benchmark tasks with explicit constraints on the reasoning trace, such as avoiding key words or changing style, to test whether a model can solve the task while masking how it thinks.
–The paper covers more than 14,000 tasks drawn from benchmarks like GPQA, MMLU-Pro, HLE, BFCL, and SWE-Bench Verified, which makes it much more substantive than a toy safety demo.
–OpenAI says GPT-5.4 Thinking controls only about 0.3% of 10k-character chains of thought, close to GPT-5.2 Thinking at 0.2%, suggesting better reasoning did not obviously make hidden reasoning easier.
–The broader research result is more nuanced: controllability rises with model size, but falls with more RL reasoning training, more test-time compute, and harder tasks, so this is not a permanently solved problem.
–For developers building agent systems, the practical takeaway is that reasoning traces remain a useful monitoring surface today, but labs now need to track this metric continuously as models get more situationally aware.

// TAGS

cot-controlreasoningsafetybenchmarkresearch

DISCOVERED

83d ago

2026-03-06

PUBLISHED

84d ago

2026-03-05

RELEVANCE

9/ 10

AUTHOR

likeastar20

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL1h ago

Anthropic drops Opus 4.8 for Claude Code

Anthropic has released Opus 4.8, integrating the new model into Claude Code with high-effort defaults for complex coding tasks. The update boosts SWE-bench Pro scores to 69.2% and drastically reduces unremarked flaws in generated code.

VIDEO1h ago

Google AI animates cardboard TPUs for I/O 2026

Google AI partners with director Laurie Rowan and Nexus Studios to create a promotional short film for Google I/O 2026. The project leverages AI models to animate physical materials like cardboard and markers into characters representing Tensor Processing Units.

MODEL1h ago

Claude Opus 4.8 drops with extended agentic autonomy

Anthropic has released Claude Opus 4.8, bringing improvements to agentic skills, reasoning, and coding capabilities at the exact same price. The update introduces sharper judgment, increased honesty about its task progress, and the ability to operate autonomously for much longer periods.