Claude Opus 4.7 hits inconsistency on MineBench
New benchmark results for Claude Opus 4.7 on MineBench reveal a regression in consistency and spatial logic compared to Opus 4.6. Despite higher scores on standard text benchmarks, the model's 3D voxel construction capabilities show a preference for scenery over structural precision, raising questions about its creative reasoning.
Claude Opus 4.7’s performance on MineBench serves as a reminder that state-of-the-art benchmark scores do not always translate to real-world creative utility. With an average inference time of 43 minutes and costs nearing $275 per build, the model is significantly more expensive and slower than its predecessor. Its tendency to prioritize scenery over core build prompts suggests an attention shift possibly tied to its new adaptive thinking mode. While the model may be optimized for academic evaluations, it struggles with the tool-heavy, multi-step logic required for complex voxel art.
DISCOVERED
4h ago
2026-04-18
PUBLISHED
6h ago
2026-04-17
RELEVANCE
AUTHOR
ENT_Alam