LLM Architecture Gallery charts KV cache evolution
Sebastian Raschka's gallery turns KV cache design into a clean timeline, from GPT-2's brute-force attention to Llama 3's GQA, DeepSeek V3's latent compression, Gemma 3's sliding windows, and Mamba-style state-space models. The pattern is selective memory: newer architectures are spending less on cache while preserving long-context quality.
This is the right arc for the field: the winning architectures are getting better at selective amnesia, not perfect recall. The catch is that medium-term memory still isn't native, so most apps keep bolting memory on from the outside.
- –GQA, MLA, and sliding-window attention all cut KV pressure by shrinking or sharing what gets cached, which is why long-context inference keeps getting cheaper.
- –DeepSeek and Gemma are strong examples of architectures trading some direct recall for surprisingly little quality loss.
- –The uncomfortable gap remains medium-term memory: RAG, prompts, files, and vector DBs are still external glue, not model-native persistence.
- –Learned compaction is promising, but code benchmarks are a much cleaner target than editorial or strategic conversations where missing one detail can fail silently.
- –Mamba-style SSMs change the memory equation entirely, but they also move the burden onto the model to compress state on the fly instead of revisiting stored context.
DISCOVERED
59d ago
2026-03-29
PUBLISHED
59d ago
2026-03-28
RELEVANCE
AUTHOR
monkey_spunk_