KIV enables 1M-token context on consumer GPUs

// 92d agoOPENSOURCE RELEASE

KIV enables 1M-token context on consumer GPUs

KIV is a middleware layer for HuggingFace transformers that replaces the standard KV cache with a tiered retrieval system to enable million-token context on limited VRAM. By exploiting the structural regularity of K vectors for indexing while offloading high-entropy V vectors to CPU RAM, it achieves near-constant GPU memory usage and functional decode speeds without modifying model weights or retraining.

// ANALYSIS

KIV democratizes long-context LLM inference by bypassing the quadratic VRAM wall through clever architectural asymmetry.

–Maintains a constant 12MB VRAM cache overhead regardless of context length, making a 1M context viable on a 12GB card.
–Achieves ~4 tokens per second at 1M context on an RTX 4070, significantly faster than traditional swapping or offloading methods.
–Pass-through implementation for HuggingFace's DynamicCache means it works out-of-the-box with popular models like Gemma, Qwen, and Phi.
–Passing 70/70 needle-in-haystack tests across varied contexts validates the efficacy of the K-vector search index.
–Future optimizations focusing on CPU-to-GPU transfer could further close the latency gap for ultra-long contexts.

// TAGS

kivllminferencegpuedge-aiopen-sourceragtransformers

DISCOVERED

92d ago

2026-04-12

PUBLISHED

92d ago

2026-04-12

RELEVANCE

8/ 10

AUTHOR

ThyGreatOof

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

NEWS17m ago

swyx outlines specialized multi-model AI workflow

In a recent tweet, swyx shared his multi-model AI stack for complex projects, assigning specialized tasks to models like sol ultra for planning, fable 5 for critiquing, and sonnet 5 for code generation. He also highlighted the importance of interactive, interview-style prompting to clarify design decisions.

NEWS19m ago

Tweet mocks Claude Fable 5 safety filters

Indie developer Pieter Levels (@levelsio) shared a post mocking the overly sensitive safety guardrails of Anthropic's Claude Fable 5 AI model. The message satirizes Fable's warning system by claiming a 'life simulation' was downgraded to Opus 4.5 without appeal, highlighting developer frustration with aggressive safety routing.

LAUNCH45m ago

Brockman highlights ChatGPT Work mobile experience

OpenAI President and Co-founder Greg Brockman shared his enthusiasm for ChatGPT Work, noting that while the new agent-based platform has received less attention than other recent updates, it offers a highly functional and impressive mobile experience. Powered by the GPT-5.6 model family, ChatGPT Work transitions ChatGPT from a conversational chatbot into an autonomous agent capable of executing complex, multi-step workflows and cross-app integrations directly from mobile and desktop interfaces.