Goertzel posts 33% ARC-AGI-3 score

// 2h agoBENCHMARK RESULT

Goertzel posts 33% ARC-AGI-3 score

Ben Goertzel says a SingularityNET researcher reached a 32.58% mean human-normalized score on ARC-AGI-3 using LLMs, procedural world models, and verification. The post is a follow-up on the interactive benchmark, which has been live since March 25, 2026 and still leaves frontier LLMs near zero without heavy scaffolding.

// ANALYSIS

The interesting part here is not just the score, but the method: this is another data point that scaffolding, not raw model prompting, is what matters on interactive agent benchmarks.

–ARC-AGI-3 is no longer a static puzzle test; it rewards exploration, hypothesis revision, and long-horizon planning, so agent architecture matters as much as model quality
–The reported 32.58% puts this result in the same rough band as other public benchmark claims, which suggests the real bottleneck is search, memory, and environment modeling
–The writeup is also a warning label for benchmark hype: a clever verifier loop can move the number without proving general intelligence
–For developers, the takeaway is practical: if your system can maintain a world model and self-check its own plans, you may get farther on hard evals than by swapping in a better base LLM
–The benchmark’s value is still real because it pressures teams to build agents that adapt over time, not just answer well once

// TAGS

arc-agi-3benchmarkevaluationagentllmreasoningcoding-agent

DISCOVERED

2h ago

2026-05-09

PUBLISHED

4h ago

2026-05-09

RELEVANCE

8/ 10

AUTHOR

marcothephoenixass

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL2h ago

Baidu launches ERNIE 5.1, slashes training cost

Baidu says ERNIE 5.1 inherits ERNIE 5.0’s foundation while shrinking total parameters to about one-third, active parameters to about one-half, and pre-training compute to roughly 6% of comparable models. It says the model improves agentic tasks, search, reasoning, and creative writing, with strong results on Arena Search.

OPEN SOURCE3h ago

Markus debuts AI workforce OS

Markus is an open-source platform for coordinating AI teams with persistent memory, built-in tools, and a responsive web UI. It pitches itself as a full runtime for messy repo work, not just another agent wrapper.

UPDATE3h ago

Claude Code 2.1.138 tightens CLI stability

Claude Code 2.1.138 is a maintenance release focused on internal fixes and better command stability. The update aims to cut down on unexpected errors rather than add new user-facing features.