Claude Opus 4.7 trails 4.6 on benchmark

// 90d agoBENCHMARK RESULT

Claude Opus 4.7 trails 4.6 on benchmark

On the Thematic Generalization Benchmark's hard subset, Claude Opus 4.7 (high reasoning) scores 72.8 inverse-rank, behind Opus 4.6's 80.6. The no-reasoning run falls further to 52.6, suggesting the model still struggles when a task depends on preserving a narrow conjunction rather than matching a broad theme.

// ANALYSIS

This looks like a real regression in constraint retention, not just a noisy eval blip.

–The benchmark is designed to punish broad matches with anti-examples, and 4.7 still gets pulled toward the wrong generalization.
–High reasoning does not close the gap to 4.6 here; the published hard-subset numbers put 4.7 behind its predecessor on the same 703-case slice.
–The no-reasoning variant dropping to 52.6 suggests the model is highly sensitive to whether it can sustain explicit deliberation.
–For developers, this is a reminder that "better model" is not a single scalar; narrow thematic inference can move opposite to coding or vision gains.

// TAGS

benchmarkreasoningllmresearchclaude-opus-4-7

DISCOVERED

90d ago

2026-04-17

PUBLISHED

90d ago

2026-04-17

RELEVANCE

9/ 10

AUTHOR

zero0_one1

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

NEWS54m ago

Claude safety filters block retro emulator developer

Independent developer Pieter Levels reported that his workflow for reverse engineering vintage Windows applications to make them compatible with his web emulator (pieter.com) is being blocked by Anthropic's AI models. After his queries were flagged by Claude Fable 5's cybersecurity safeguards, he attempted to fall back to Claude Opus 4.8, only to find that its strict safety and refusal measures blocked his requests as well.

MODEL1h ago

Moonshot Prepares 3-Trillion Parameter Kimi K3

According to a report by the Financial Times, Chinese AI startup Moonshot is poised to release its new Kimi K3 model as early as tonight. The model is rumored to possess between 2 and 3 trillion total parameters, which would make it the largest AI model released in China so far.

UPDATE2h ago

Lightpanda agent REPL renders styled terminal markdown

Lightpanda has introduced a markdown-to-ANSI terminal renderer for its interactive agent REPL, styling headings, lists, inline formatting, and OSC 8 clickable links. The rendering is gated exclusively to interactive TTY sessions to avoid breaking machine-readable piped workflows.