MemPalace Benchmark Claims Get Undercut

// 50d agoBENCHMARK RESULT

MemPalace Benchmark Claims Get Undercut

A Reddit discussion argues that MemPalace’s launch claims of “100% on LoCoMo” and a “perfect score on LongMemEval” are misleading because the repository’s own BENCHMARKS.md documents major methodological caveats. The post says the LoCoMo result is inflated by top-k retrieval that includes the full conversation, the LongMemEval result is actually retrieval-only rather than end-to-end QA, and the remaining “perfect” hybrid mode appears to be tuned to a handful of specific failure cases. It also points out mismatches between launch marketing and the codebase around contradiction detection and “lossless compression.”

// ANALYSIS

Hot take: this reads less like a breakthrough benchmark win and more like a case study in why memory evals need stricter, standardized pipelines.

–The LoCoMo “100%” appears to rely on a candidate-pool shortcut, so the retrieval step is not meaningfully discriminating.
–The LongMemEval claim is a category error if the runner only measures retrieval recall instead of the published end-to-end QA task.
–The reported perfect score seems to come from test-specific fixes, which makes it hard to treat as a general result.
–The post also highlights a gap between marketing language and the code, especially for contradiction detection and “lossless compression.”
–The strongest takeaway is not the product itself, but how easily benchmark framing can make an easier internal metric look like a field-level result.

// TAGS

mempalacebenchmarkinglocomolongmemevalmemoryopensourceaievaluation

DISCOVERED

50d ago

2026-04-07

PUBLISHED

50d ago

2026-04-07

RELEVANCE

9/ 10

AUTHOR

PenfieldLabs

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL1h ago

Prism ML launches Bonsai Image 4B variants

Prism ML has released Bonsai Image 4B, a compact text-to-image diffusion model family built from FLUX.2 Klein 4B for local inference on Apple Silicon and NVIDIA GPUs. The launch includes 1-bit and ternary variants, plus Bonsai Studio for trying the model on iPhone.

OPEN SOURCE1h ago

OpenMobius-skill packages ICT, SMC for agents

OpenMobius-skill turns ICT and smart money concepts into a reusable skill for Claude Code, Codex, OpenClaw, and Hermes, backed by 964 knowledge cards, live market data, and chart generation. Its 0.2.0 update on 2026-05-23 made the SMC structural indicator the default analysis path and added automatic overlays plus freshness disclosure.

OPEN SOURCE1h ago

Hallmark fights AI template sameness

Hallmark is an open-source design skill for Claude Code, Cursor, and Codex that pushes generated UIs away from samey, default-looking layouts. It varies macrostructure, theme, and layout, then runs style gates before handing work back.