OPEN_SOURCE ↗
REDDIT · REDDIT// 6h agoOPENSOURCE RELEASE
KIV enables 1M-token context on consumer GPUs
KIV is a middleware layer for HuggingFace transformers that replaces the standard KV cache with a tiered retrieval system to enable million-token context on limited VRAM. By exploiting the structural regularity of K vectors for indexing while offloading high-entropy V vectors to CPU RAM, it achieves near-constant GPU memory usage and functional decode speeds without modifying model weights or retraining.
// ANALYSIS
KIV democratizes long-context LLM inference by bypassing the quadratic VRAM wall through clever architectural asymmetry.
- –Maintains a constant 12MB VRAM cache overhead regardless of context length, making a 1M context viable on a 12GB card.
- –Achieves ~4 tokens per second at 1M context on an RTX 4070, significantly faster than traditional swapping or offloading methods.
- –Pass-through implementation for HuggingFace's DynamicCache means it works out-of-the-box with popular models like Gemma, Qwen, and Phi.
- –Passing 70/70 needle-in-haystack tests across varied contexts validates the efficacy of the K-vector search index.
- –Future optimizations focusing on CPU-to-GPU transfer could further close the latency gap for ultra-long contexts.
// TAGS
kivllminferencegpuedge-aiopen-sourceragtransformers
DISCOVERED
6h ago
2026-04-12
PUBLISHED
9h ago
2026-04-12
RELEVANCE
8/ 10
AUTHOR
ThyGreatOof