LoCoMo scores hide eval mismatch
The Reddit post argues that AI memory systems are being compared with incompatible LoCoMo scoring setups. The official benchmark metric and the custom retrieval-style metrics many vendors cite are not measuring the same thing, so side-by-side scores are misleading.
Hot take: this is less a leaderboard problem than a methodology problem.
- –LoCoMo’s ACL 2024 paper gives a clear reference point, with human performance far above model baselines, so the benchmark itself is not the issue.
- –Once teams swap in custom judges, retrieval accuracy, or keyword matching, they are no longer reporting the same metric even if they cite the same dataset.
- –Memory products can optimize different stages of the stack, but they need to label results honestly: retrieval quality, answer synthesis, or end-to-end memory performance.
- –The only meaningful comparison is a shared protocol with fixed prompts, fixed judging rules, repeated runs, and published variance.
DISCOVERED
58d ago
2026-03-31
PUBLISHED
58d ago
2026-03-31
RELEVANCE
AUTHOR
Efficient_Joke3384