Goertzel posts 33% ARC-AGI-3 score
Ben Goertzel says a SingularityNET researcher reached a 32.58% mean human-normalized score on ARC-AGI-3 using LLMs, procedural world models, and verification. The post is a follow-up on the interactive benchmark, which has been live since March 25, 2026 and still leaves frontier LLMs near zero without heavy scaffolding.
The interesting part here is not just the score, but the method: this is another data point that scaffolding, not raw model prompting, is what matters on interactive agent benchmarks.
- –ARC-AGI-3 is no longer a static puzzle test; it rewards exploration, hypothesis revision, and long-horizon planning, so agent architecture matters as much as model quality
- –The reported 32.58% puts this result in the same rough band as other public benchmark claims, which suggests the real bottleneck is search, memory, and environment modeling
- –The writeup is also a warning label for benchmark hype: a clever verifier loop can move the number without proving general intelligence
- –For developers, the takeaway is practical: if your system can maintain a world model and self-check its own plans, you may get farther on hard evals than by swapping in a better base LLM
- –The benchmark’s value is still real because it pressures teams to build agents that adapt over time, not just answer well once
DISCOVERED
2h ago
2026-05-09
PUBLISHED
4h ago
2026-05-09
RELEVANCE
AUTHOR
marcothephoenixass
