BACK_TO_FEEDAICRIER_2
c137 tops 90.4% on LongMemEval-S
OPEN_SOURCE ↗
REDDIT · REDDIT// 6h agoBENCHMARK RESULT

c137 tops 90.4% on LongMemEval-S

c137 says its structured memory stack hit 90.4% on LongMemEval-S, with 98% retrieval accuracy and roughly half the token budget of comparable systems. The team is also publishing a bench viewer so people can inspect all 500 questions, ground truth, and failure modes despite the project being closed source.

// ANALYSIS

Strong score, but the bigger story is that retrieval seems close to solved and most remaining errors are coming from the answerer misusing the right context. That makes this look less like a vector-search breakthrough and more like a proof that disciplined memory structure can beat fancier agent loops for this benchmark.

  • 98% retrieval accuracy is the key claim: only 10 of 500 questions lacked the needed context, so the main ceiling now appears to be answer synthesis, not memory lookup
  • The no-embeddings, 3-stage retrieve -> answer -> store pipeline is compelling for one-hop memory tasks because it keeps prompts bounded as history grows
  • The token story matters as much as the score: if c137 really delivers these results at about half the prompt budget, the architecture is more scalable than many embedding-heavy approaches
  • The closed-source caveat is real, so the public bench viewer is doing important trust work by exposing the exact question, ground truth, response, and failure bucket
  • Benchmark-wise, this is still a specialized long-term memory eval, so it says more about persistent conversational memory than about general-purpose reasoning
// TAGS
c137benchmarkllmsearchreasoninginference

DISCOVERED

6h ago

2026-04-26

PUBLISHED

8h ago

2026-04-26

RELEVANCE

9/ 10

AUTHOR

MontyOW