YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

c137 tops 90.4% on LongMemEval-S

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

c137 tops 90.4% on LongMemEval-S
OPEN LINK ↗
// 45d agoBENCHMARK RESULT

c137 tops 90.4% on LongMemEval-S

c137 says its structured memory stack hit 90.4% on LongMemEval-S, with 98% retrieval accuracy and roughly half the token budget of comparable systems. The team is also publishing a bench viewer so people can inspect all 500 questions, ground truth, and failure modes despite the project being closed source.

// ANALYSIS

Strong score, but the bigger story is that retrieval seems close to solved and most remaining errors are coming from the answerer misusing the right context. That makes this look less like a vector-search breakthrough and more like a proof that disciplined memory structure can beat fancier agent loops for this benchmark.

  • 98% retrieval accuracy is the key claim: only 10 of 500 questions lacked the needed context, so the main ceiling now appears to be answer synthesis, not memory lookup
  • The no-embeddings, 3-stage retrieve -> answer -> store pipeline is compelling for one-hop memory tasks because it keeps prompts bounded as history grows
  • The token story matters as much as the score: if c137 really delivers these results at about half the prompt budget, the architecture is more scalable than many embedding-heavy approaches
  • The closed-source caveat is real, so the public bench viewer is doing important trust work by exposing the exact question, ground truth, response, and failure bucket
  • Benchmark-wise, this is still a specialized long-term memory eval, so it says more about persistent conversational memory than about general-purpose reasoning
// TAGS
c137benchmarkllmsearchreasoninginference

DISCOVERED

45d ago

2026-04-26

PUBLISHED

45d ago

2026-04-26

RELEVANCE

9/ 10

AUTHOR

MontyOW