Memory Sparse Attention Hits 100M Tokens
MSA is an end-to-end trainable long-term memory architecture from EverMind that aims to scale LLM context from the usual 128K to 1M-token ceiling up to 100M tokens. According to the paper and model card, it combines sparse attention, document-wise RoPE, KV cache compression, and memory parallelism to keep training and inference complexity linear while preserving most performance at extreme context lengths. EverMind has also released a 4B Qwen3-based model and open-sourced code, but the setup depends on their custom serving/inference stack rather than standard Transformers out of the box.
This looks less like a retrofit and more like a new memory subsystem for LLMs, which is exactly why it is interesting.
- –The technical claim is strong: the paper reports under 9% degradation when scaling from 16K to 100M tokens, plus 100M-token inference on 2xA800 GPUs.
- –The project is credible as a research release: arXiv paper, Hugging Face model card, GitHub code, and an official blog post all line up.
- –The tradeoff is real: you do not just swap this into an existing model; the architecture needs training and their serving path is custom.
- –For practical adoption, the biggest question is ecosystem friction, not just benchmark quality: integration with existing deployment stacks and model families will likely be the hard part.
- –My take: if the results hold under broader workloads, this is one of the more meaningful long-context memory ideas I’ve seen recently, but it is still research-first infrastructure, not a plug-and-play product.
DISCOVERED
5d ago
2026-04-07
PUBLISHED
5d ago
2026-04-07
RELEVANCE
AUTHOR
ratbastid2000