SemiAnalysis benchmarks Anthropic, OpenAI on long horizon coding

// 45d agoBENCHMARK RESULT

SemiAnalysis benchmarks Anthropic, OpenAI on long horizon coding

SemiAnalysis purchased various subscription plans from both Anthropic and OpenAI to randomly run long-horizon coding tasks, likely to compare the models' practical performance and reliability over extended interactions. The testing aims to benchmark how these leading models handle complex, multi-step coding scenarios under different subscription tiers.

// ANALYSIS

Benchmarking long-horizon capabilities is crucial as models are increasingly used for complex, autonomous tasks.

–Tests the practical limits of context windows and sustained reasoning.
–Compares the value propositions of different subscription plans from major AI providers.
–Helps developers understand the real-world reliability of models for coding tasks.

// TAGS

aillmbenchmarkcodinganthropicopenaisemianalysis

DISCOVERED

45d ago

2026-06-11

PUBLISHED

45d ago

2026-06-11

RELEVANCE

8/ 10

AUTHOR

pueblokc

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

BENCHMARK1h ago

Claude Fable 5 stops early in coding benchmark

In a benchmark test conducted by Income Stream Surfers, Anthropic's flagship Claude Fable 5 model was tasked with generating an end-to-end web application using Managed Agents. Despite running on the same prompt and budget as Claude Opus 5, Fable 5 prematurely stopped execution after 94.6k output tokens, leaving the application partially incomplete.

NEWS1h ago

Gatwick Airport launches Stanley Robotics valet parking

London Gatwick Airport has partnered with Stanley Robotics to launch an autonomous valet parking service near its South Terminal. Passengers leave their vehicles in dedicated cabins while autonomous robots named "Stan" park and retrieve cars based on real-time flight schedules.

UPDATE3h ago

Anthropic cuts Claude Code prompt 80%, adds /doctor

Anthropic updated the Claude Code agent harness, reducing its default system prompt size by 80% in favor of progressive skill disclosure. The update introduces a `/doctor` command to help developers right-size context, eliminate over-constrained rules, and optimize prompt configuration files such as `CLAUDE.md`.