LoCoMo audit exposes broken scoring
An audit of LoCoMo says 6.4% of its answer key is wrong and its judge is too permissive to trust tight leaderboard deltas. The post also argues LongMemEval is closer to a context-window stress test than a true long-term memory benchmark, while LoCoMo-Plus only partially fixes the problem with a new cognitive category.
This is a pretty damning audit: if the answer key is wrong and the grader is loose, benchmark deltas stop meaning what people think they mean. The unsettling part is that the failure mode is not edge-case noise but a systematic reward for vague, on-topic answers. The claim of 99 score-corrupting errors across 1,540 questions creates a real noise floor, so tiny leaderboard gains are hard to interpret. The gpt-4o-mini judge accepting 62.81% of intentionally wrong answers is a textbook example of scoring that rewards topicality over exactness. LongMemEval is useful as a retrieval-in-long-context test, but if the whole corpus fits in current windows it is not strong evidence of durable long-term memory. LoCoMo-Plus is the most interesting extension because its cue-trigger cognitive questions probe latent intent, but it still inherits the old broken legacy set unchanged. The broader field still needs standardized ingestion, prompts, judge models, and variance reporting before cross-paper comparisons deserve much confidence.
DISCOVERED
19d ago
2026-03-23
PUBLISHED
19d ago
2026-03-23
RELEVANCE
AUTHOR
PenfieldLabs