OPEN_SOURCE ↗
REDDIT · REDDIT// 1h agoBENCHMARK RESULT
Claude Opus 4.7 trails 4.6 on benchmark
On the Thematic Generalization Benchmark's hard subset, Claude Opus 4.7 (high reasoning) scores 72.8 inverse-rank, behind Opus 4.6's 80.6. The no-reasoning run falls further to 52.6, suggesting the model still struggles when a task depends on preserving a narrow conjunction rather than matching a broad theme.
// ANALYSIS
This looks like a real regression in constraint retention, not just a noisy eval blip.
- –The benchmark is designed to punish broad matches with anti-examples, and 4.7 still gets pulled toward the wrong generalization.
- –High reasoning does not close the gap to 4.6 here; the published hard-subset numbers put 4.7 behind its predecessor on the same 703-case slice.
- –The no-reasoning variant dropping to 52.6 suggests the model is highly sensitive to whether it can sustain explicit deliberation.
- –For developers, this is a reminder that "better model" is not a single scalar; narrow thematic inference can move opposite to coding or vision gains.
// TAGS
benchmarkreasoningllmresearchclaude-opus-4-7
DISCOVERED
1h ago
2026-04-17
PUBLISHED
5h ago
2026-04-17
RELEVANCE
9/ 10
AUTHOR
zero0_one1