BACK_TO_FEEDAICRIER_2
WMB-100K exposes brittle memory systems
OPEN_SOURCE ↗
REDDIT · REDDIT// 18d agoBENCHMARK RESULT

WMB-100K exposes brittle memory systems

WMB-100K is an open-source benchmark for AI memory systems that pushes retrieval across 100,000 turns, with a free dataset and about $0.07 to score, and it penalizes false memories instead of ignoring them. After swapping keyword matching for exact scoring, the results dropped sharply, exposing how brittle many memory stacks are at real scale.

// ANALYSIS

This is the right kind of benchmark: it rewards exact retrieval and punishes confident hallucinations, which is much closer to production reality than fuzzy keyword matching. Once you score honestly, the "good enough" memory stack starts looking a lot less good.

  • Exact-turn scoring strips out the false comfort of near-matches and makes the benchmark about actual recall, not semantic vibes.
  • The 100K-turn setup is the real stress test; short benchmarks mostly measure compression tricks and prompt luck.
  • False-memory probes matter because the worst failure mode is inventing a fact the user never said.
  • The published runs show LangChain/FAISS and Mem0 collapsing on the 100K setup, which is a blunt reminder that current memory layers are still fragile.
  • The free dataset and cheap scoring make the benchmark easy to reproduce, so the community can compare systems without hand-waving.
// TAGS
wmb-100kbenchmarktestingllmagentopen-source

DISCOVERED

18d ago

2026-03-25

PUBLISHED

18d ago

2026-03-25

RELEVANCE

8/ 10

AUTHOR

Efficient_Joke3384