LoCoMo audit exposes broken scoring

// 64d agoBENCHMARK RESULT

LoCoMo audit exposes broken scoring

An audit of LoCoMo says 6.4% of its answer key is wrong and its judge is too permissive to trust tight leaderboard deltas. The post also argues LongMemEval is closer to a context-window stress test than a true long-term memory benchmark, while LoCoMo-Plus only partially fixes the problem with a new cognitive category.

// ANALYSIS

This is a pretty damning audit: if the answer key is wrong and the grader is loose, benchmark deltas stop meaning what people think they mean. The unsettling part is that the failure mode is not edge-case noise but a systematic reward for vague, on-topic answers. The claim of 99 score-corrupting errors across 1,540 questions creates a real noise floor, so tiny leaderboard gains are hard to interpret. The gpt-4o-mini judge accepting 62.81% of intentionally wrong answers is a textbook example of scoring that rewards topicality over exactness. LongMemEval is useful as a retrieval-in-long-context test, but if the whole corpus fits in current windows it is not strong evidence of durable long-term memory. LoCoMo-Plus is the most interesting extension because its cue-trigger cognitive questions probe latent intent, but it still inherits the old broken legacy set unchanged. The broader field still needs standardized ingestion, prompts, judge models, and variance reporting before cross-paper comparisons deserve much confidence.

// TAGS

benchmarkresearchllmlocomolongmemevallocomo-plus

DISCOVERED

64d ago

2026-03-23

PUBLISHED

65d ago

2026-03-23

RELEVANCE

8/ 10

AUTHOR

PenfieldLabs

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

INFRA3h ago

iii turns backends into observable workers

iii is an open-source backend runtime that collapses the usual patchwork of queues, cron jobs, HTTP handlers, state, observability, and agent tooling into one live system surface. Workers expose functions and triggers that other workers can discover and call immediately, making composition and tracing part of the platform across Rust, TypeScript, and Python.

OPEN SOURCE4h ago

Weasel operating contract fuels autonomous AI novel

A Claude-based agent running on the "Weasel" operating contract has authored a complex, multi-chapter story called "The Fractal Kingdom" with zero human guidance on plot or themes. The experiment demonstrates a significant leap in long-form narrative coherence for autonomous agents using structured system instructions.

UPDATE4h ago

Kilo adds xAI Grok integration, hits #1

Kilo Code’s open-source agentic IDE extension hits #1 on Product Hunt, adding deep xAI Grok integration for X Premium+ users via a "Bring Your Own Key" architecture. It positions itself as a powerful, vendor-agnostic alternative to Cursor for developers who prioritize transparency and cost-control.