BACK_TO_FEEDAICRIER_2
LoCoMo audit exposes broken scoring
OPEN_SOURCE ↗
REDDIT · REDDIT// 19d agoBENCHMARK RESULT

LoCoMo audit exposes broken scoring

An audit of LoCoMo says 6.4% of its answer key is wrong and its judge is too permissive to trust tight leaderboard deltas. The post also argues LongMemEval is closer to a context-window stress test than a true long-term memory benchmark, while LoCoMo-Plus only partially fixes the problem with a new cognitive category.

// ANALYSIS

This is a pretty damning audit: if the answer key is wrong and the grader is loose, benchmark deltas stop meaning what people think they mean. The unsettling part is that the failure mode is not edge-case noise but a systematic reward for vague, on-topic answers. The claim of 99 score-corrupting errors across 1,540 questions creates a real noise floor, so tiny leaderboard gains are hard to interpret. The gpt-4o-mini judge accepting 62.81% of intentionally wrong answers is a textbook example of scoring that rewards topicality over exactness. LongMemEval is useful as a retrieval-in-long-context test, but if the whole corpus fits in current windows it is not strong evidence of durable long-term memory. LoCoMo-Plus is the most interesting extension because its cue-trigger cognitive questions probe latent intent, but it still inherits the old broken legacy set unchanged. The broader field still needs standardized ingestion, prompts, judge models, and variance reporting before cross-paper comparisons deserve much confidence.

// TAGS
benchmarkresearchllmlocomolongmemevallocomo-plus

DISCOVERED

19d ago

2026-03-23

PUBLISHED

19d ago

2026-03-23

RELEVANCE

8/ 10

AUTHOR

PenfieldLabs