OPEN_SOURCE ↗
REDDIT · REDDIT// 18d agoBENCHMARK RESULT
WMB-100K exposes brittle memory systems
WMB-100K is an open-source benchmark for AI memory systems that pushes retrieval across 100,000 turns, with a free dataset and about $0.07 to score, and it penalizes false memories instead of ignoring them. After swapping keyword matching for exact scoring, the results dropped sharply, exposing how brittle many memory stacks are at real scale.
// ANALYSIS
This is the right kind of benchmark: it rewards exact retrieval and punishes confident hallucinations, which is much closer to production reality than fuzzy keyword matching. Once you score honestly, the "good enough" memory stack starts looking a lot less good.
- –Exact-turn scoring strips out the false comfort of near-matches and makes the benchmark about actual recall, not semantic vibes.
- –The 100K-turn setup is the real stress test; short benchmarks mostly measure compression tricks and prompt luck.
- –False-memory probes matter because the worst failure mode is inventing a fact the user never said.
- –The published runs show LangChain/FAISS and Mem0 collapsing on the 100K setup, which is a blunt reminder that current memory layers are still fragile.
- –The free dataset and cheap scoring make the benchmark easy to reproduce, so the community can compare systems without hand-waving.
// TAGS
wmb-100kbenchmarktestingllmagentopen-source
DISCOVERED
18d ago
2026-03-25
PUBLISHED
18d ago
2026-03-25
RELEVANCE
8/ 10
AUTHOR
Efficient_Joke3384