MemPalace Benchmark Claims Get Undercut
A Reddit discussion argues that MemPalace’s launch claims of “100% on LoCoMo” and a “perfect score on LongMemEval” are misleading because the repository’s own BENCHMARKS.md documents major methodological caveats. The post says the LoCoMo result is inflated by top-k retrieval that includes the full conversation, the LongMemEval result is actually retrieval-only rather than end-to-end QA, and the remaining “perfect” hybrid mode appears to be tuned to a handful of specific failure cases. It also points out mismatches between launch marketing and the codebase around contradiction detection and “lossless compression.”
Hot take: this reads less like a breakthrough benchmark win and more like a case study in why memory evals need stricter, standardized pipelines.
- –The LoCoMo “100%” appears to rely on a candidate-pool shortcut, so the retrieval step is not meaningfully discriminating.
- –The LongMemEval claim is a category error if the runner only measures retrieval recall instead of the published end-to-end QA task.
- –The reported perfect score seems to come from test-specific fixes, which makes it hard to treat as a general result.
- –The post also highlights a gap between marketing language and the code, especially for contradiction detection and “lossless compression.”
- –The strongest takeaway is not the product itself, but how easily benchmark framing can make an easier internal metric look like a field-level result.
DISCOVERED
4d ago
2026-04-07
PUBLISHED
4d ago
2026-04-07
RELEVANCE
AUTHOR
PenfieldLabs