LLM Architecture Gallery charts KV cache evolution

// 105d agoNEWS

LLM Architecture Gallery charts KV cache evolution

Sebastian Raschka's gallery turns KV cache design into a clean timeline, from GPT-2's brute-force attention to Llama 3's GQA, DeepSeek V3's latent compression, Gemma 3's sliding windows, and Mamba-style state-space models. The pattern is selective memory: newer architectures are spending less on cache while preserving long-context quality.

// ANALYSIS

This is the right arc for the field: the winning architectures are getting better at selective amnesia, not perfect recall. The catch is that medium-term memory still isn't native, so most apps keep bolting memory on from the outside.

–GQA, MLA, and sliding-window attention all cut KV pressure by shrinking or sharing what gets cached, which is why long-context inference keeps getting cheaper.
–DeepSeek and Gemma are strong examples of architectures trading some direct recall for surprisingly little quality loss.
–The uncomfortable gap remains medium-term memory: RAG, prompts, files, and vector DBs are still external glue, not model-native persistence.
–Learned compaction is promising, but code benchmarks are a much cleaner target than editorial or strategic conversations where missing one detail can fail silently.
–Mamba-style SSMs change the memory equation entirely, but they also move the burden onto the model to compress state on the fly instead of revisiting stored context.

// TAGS

llminferencegpuresearchopen-weightsragllm-architecture-gallery

DISCOVERED

105d ago

2026-03-29

PUBLISHED

105d ago

2026-03-28

RELEVANCE

8/ 10

AUTHOR

monkey_spunk_

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE26m ago

OpenDisplay turns iOS devices into Mac monitors

OpenDisplay is an open-source utility that streams macOS desktops to iPads or iPhones over USB or Wi-Fi, turning them into low-latency, high-resolution external monitors. Leveraging macOS's private CGVirtualDisplay API, ScreenCaptureKit, and VideoToolbox, it integrates directly into macOS Display settings as a true extended display without needing external servers or telemetry.

OPEN SOURCE26m ago

NASA releases SpaceWasm flight WebAssembly interpreter

spacewasm is a WebAssembly interpreter developed by NASA and Caltech for safety-critical flight software. Written in Rust, it decodes Wasm modules in a single pass into an optimized intermediate representation and utilizes a custom memory model with fixed-size allocation pages to guarantee deterministic execution and avoid memory panics in resource-constrained embedded systems.

OPEN SOURCE26m ago

Agent Skills guides agent UI design

Agent Skills is an open-source library and prompting system designed to help front-end coding agents like Cursor and Claude Code build premium user interfaces. The project provides reusable design guardrails and procedural workflows for advanced styling, GSAP animations, and WebGL.