BACK_TO_FEEDAICRIER_2
LLM Architecture Gallery charts KV cache evolution
OPEN_SOURCE ↗
REDDIT · REDDIT// 14d agoNEWS

LLM Architecture Gallery charts KV cache evolution

Sebastian Raschka's gallery turns KV cache design into a clean timeline, from GPT-2's brute-force attention to Llama 3's GQA, DeepSeek V3's latent compression, Gemma 3's sliding windows, and Mamba-style state-space models. The pattern is selective memory: newer architectures are spending less on cache while preserving long-context quality.

// ANALYSIS

This is the right arc for the field: the winning architectures are getting better at selective amnesia, not perfect recall. The catch is that medium-term memory still isn't native, so most apps keep bolting memory on from the outside.

  • GQA, MLA, and sliding-window attention all cut KV pressure by shrinking or sharing what gets cached, which is why long-context inference keeps getting cheaper.
  • DeepSeek and Gemma are strong examples of architectures trading some direct recall for surprisingly little quality loss.
  • The uncomfortable gap remains medium-term memory: RAG, prompts, files, and vector DBs are still external glue, not model-native persistence.
  • Learned compaction is promising, but code benchmarks are a much cleaner target than editorial or strategic conversations where missing one detail can fail silently.
  • Mamba-style SSMs change the memory equation entirely, but they also move the burden onto the model to compress state on the fly instead of revisiting stored context.
// TAGS
llminferencegpuresearchopen-weightsragllm-architecture-gallery

DISCOVERED

14d ago

2026-03-29

PUBLISHED

14d ago

2026-03-28

RELEVANCE

8/ 10

AUTHOR

monkey_spunk_