OPEN_SOURCE ↗
REDDIT · REDDIT// 2h agoRESEARCH PAPER
Entropy, OLS, SVD tame KV cache spikes
HAE is a prototype KV-cache compression scheme that selects tokens by attention entropy, reconstructs discarded content with OLS, and compresses the result with SVD. The author says it cuts reconstruction error by about 3x at low memory and avoids the selective error spikes seen with Top-K pruning.
// ANALYSIS
Promising direction, but it is still proving a narrow claim: better reconstruction on a synthetic setup. The hard part is whether the extra math stays worth it once you measure real latency, throughput, and model-task variability.
- –Top-K’s failure mode here is plausible: most tokens look fine, but a few structurally important ones blow up error
- –Entropy is a smarter selection signal than raw magnitude, but it may be brittle when attention is diffuse for reasons that still matter downstream
- –OLS plus SVD shifts the bottleneck from memory to compute, so kernel efficiency and amortization cadence decide whether this is practical
- –The memory win is interesting, but the fairness assumptions matter; “lower memory” is only useful if end-to-end serving cost also improves
- –If it generalizes, this looks more like a long-context serving strategy than a simple pruning replacement
// TAGS
haellminferencebenchmarkresearch
DISCOVERED
2h ago
2026-04-19
PUBLISHED
3h ago
2026-04-19
RELEVANCE
8/ 10
AUTHOR
Many_Perception_1703