BACK_TO_FEEDAICRIER_2
Claude Opus 4.7 trails 4.6 on benchmark
OPEN_SOURCE ↗
REDDIT · REDDIT// 1h agoBENCHMARK RESULT

Claude Opus 4.7 trails 4.6 on benchmark

On the Thematic Generalization Benchmark's hard subset, Claude Opus 4.7 (high reasoning) scores 72.8 inverse-rank, behind Opus 4.6's 80.6. The no-reasoning run falls further to 52.6, suggesting the model still struggles when a task depends on preserving a narrow conjunction rather than matching a broad theme.

// ANALYSIS

This looks like a real regression in constraint retention, not just a noisy eval blip.

  • The benchmark is designed to punish broad matches with anti-examples, and 4.7 still gets pulled toward the wrong generalization.
  • High reasoning does not close the gap to 4.6 here; the published hard-subset numbers put 4.7 behind its predecessor on the same 703-case slice.
  • The no-reasoning variant dropping to 52.6 suggests the model is highly sensitive to whether it can sustain explicit deliberation.
  • For developers, this is a reminder that "better model" is not a single scalar; narrow thematic inference can move opposite to coding or vision gains.
// TAGS
benchmarkreasoningllmresearchclaude-opus-4-7

DISCOVERED

1h ago

2026-04-17

PUBLISHED

5h ago

2026-04-17

RELEVANCE

9/ 10

AUTHOR

zero0_one1