SemiAnalysis benchmarks Anthropic, OpenAI on long horizon coding
SemiAnalysis purchased various subscription plans from both Anthropic and OpenAI to randomly run long-horizon coding tasks, likely to compare the models' practical performance and reliability over extended interactions. The testing aims to benchmark how these leading models handle complex, multi-step coding scenarios under different subscription tiers.
Benchmarking long-horizon capabilities is crucial as models are increasingly used for complex, autonomous tasks.
- –Tests the practical limits of context windows and sustained reasoning.
- –Compares the value propositions of different subscription plans from major AI providers.
- –Helps developers understand the real-world reliability of models for coding tasks.
DISCOVERED
2h ago
2026-06-11
PUBLISHED
2h ago
2026-06-11
RELEVANCE
AUTHOR
pueblokc