Llama.cpp asymmetric KV cache halves VRAM

// 45d agoINFRASTRUCTURE

Llama.cpp asymmetric KV cache halves VRAM

A community evaluation found that mixing an 8-bit key cache with a 4-bit value cache in llama.cpp cuts memory usage in half for only a 1.3% precision loss. Developers are pushing to include this asymmetric configuration in default CUDA builds to prevent slow CPU fallbacks during prompt processing.

// ANALYSIS

This is a massive efficiency unlock for developers trying to squeeze large-context models onto consumer GPUs.

–High-precision keys (q8_0) preserve attention accuracy, while values tolerate heavy 4-bit quantization (q4_0)
–Mixing `-ctk q8_0 -ctv q4_0` currently triggers a slow CPU fallback unless manually compiled with the exhaustive `FA_ALL_QUANTS` flag
–Adding this specific combo to default builds would keep prompt processing on the GPU out of the box
–Asymmetric KV quantization is rapidly becoming the standard trick for maximizing context lengths on local hardware

// TAGS

llama.cppinferencequantizationopen-sourcelocal-firstlong-contextllm

DISCOVERED

45d ago

2026-05-22

PUBLISHED

45d ago

2026-05-22

RELEVANCE

8/ 10

AUTHOR

Ueberlord

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE1h ago

Robbyant open-sources LingBot-Vision for embodied AI

Robbyant, an embodied AI company under Ant Group, has open-sourced the LingBot-Vision family of self-supervised vision backbones ranging from 21M to 1.1B parameters. Developed for robot perception, these DINO-based models use a novel masked boundary modeling technique and are released under the Apache-2.0 license.

UPDATE2h ago

ZeticMLange 1.9.0 adds on-device RAG and VLM

ZeticMLange 1.9.0 introduces major capabilities for local, on-device AI operations across Android, iOS, and Flutter. The update adds support for vision-language models (LFM-VL), allowing developers to run image and text processing locally, as well as on-device Retrieval-Augmented Generation (RAG) and function calling, ensuring high privacy, reduced latency, and lower cloud computing costs by leveraging local mobile NPUs.

UPDATE3h ago

VaultBags upgrades $VAULT to autonomous AI agent

VaultBags has introduced a major update transforming its $VAULT token into an active, on-chain AI agent. The newly upgraded agent is capable of thinking, performing verifiable operations on-chain, transacting with other agents, and directly distributing yields or real assets to its users.