BACK_TO_FEEDAICRIER_2
MemPalace Benchmark Claims Get Undercut
OPEN_SOURCE ↗
REDDIT · REDDIT// 4d agoBENCHMARK RESULT

MemPalace Benchmark Claims Get Undercut

A Reddit discussion argues that MemPalace’s launch claims of “100% on LoCoMo” and a “perfect score on LongMemEval” are misleading because the repository’s own BENCHMARKS.md documents major methodological caveats. The post says the LoCoMo result is inflated by top-k retrieval that includes the full conversation, the LongMemEval result is actually retrieval-only rather than end-to-end QA, and the remaining “perfect” hybrid mode appears to be tuned to a handful of specific failure cases. It also points out mismatches between launch marketing and the codebase around contradiction detection and “lossless compression.”

// ANALYSIS

Hot take: this reads less like a breakthrough benchmark win and more like a case study in why memory evals need stricter, standardized pipelines.

  • The LoCoMo “100%” appears to rely on a candidate-pool shortcut, so the retrieval step is not meaningfully discriminating.
  • The LongMemEval claim is a category error if the runner only measures retrieval recall instead of the published end-to-end QA task.
  • The reported perfect score seems to come from test-specific fixes, which makes it hard to treat as a general result.
  • The post also highlights a gap between marketing language and the code, especially for contradiction detection and “lossless compression.”
  • The strongest takeaway is not the product itself, but how easily benchmark framing can make an easier internal metric look like a field-level result.
// TAGS
mempalacebenchmarkinglocomolongmemevalmemoryopensourceaievaluation

DISCOVERED

4d ago

2026-04-07

PUBLISHED

4d ago

2026-04-07

RELEVANCE

9/ 10

AUTHOR

PenfieldLabs