llama.cpp KV cache quantization backfires on DGX Spark

// 103d agoBENCHMARK RESULT

llama.cpp KV cache quantization backfires on DGX Spark

On NVIDIA DGX Spark, llama.cpp's q4_0 KV cache mode performs worse than f16 in the reported long-context benchmark, and even uses more memory at 64K tokens. The only quantized setting that still looks practical here is q8_0, which keeps most of the memory savings without the same runaway overhead.

// ANALYSIS

This is a hardware-specific reminder that compression only helps when memory pressure is the real bottleneck. On a 128GB unified-memory system, q4_0 turns into a false economy: the metadata and dequantization cost can outweigh the bytes saved.

–The 64K result is the standout: prompt throughput falls from 282.7 tok/s to 21.3 tok/s, which looks like a pathological implementation or kernel-path issue, not just ordinary quantization overhead.
–q8_0 is the sane middle ground here: it roughly halves KV cache size without the dramatic slowdown, so it preserves the benefit that actually matters on Spark.
–The benchmark supports a broader point about local inference stacks: software quantization schemes are not automatically good on modern unified-memory hardware.
–For Blackwell-class systems, the more interesting path is hardware-aware or zero-overhead approaches like NVFP4 or TurboQuant, not legacy cache formats that still depend on software dequant loops.

// TAGS

llama-cppbenchmarkinferencegpullm

DISCOVERED

103d ago

2026-03-31

PUBLISHED

104d ago

2026-03-31

RELEVANCE

8/ 10

AUTHOR

dentity9000

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE34m ago

Perplexity Computer integrates Grok 4.5

Perplexity has integrated xAI's Grok 4.5 as the orchestrator for Perplexity Computer, achieving a top score of 0.328 on its internal WANDR benchmark. The integration is highly cost-effective, running at approximately half the cost of Anthropic's Claude Opus 4.8.

UPDATE45m ago

Inference optimizations boost GPT-5.6 Sol usage limits

Recent updates for Codex and ChatGPT Work have introduced inference optimizations, the savings of which are being passed directly to users. This results in approximately 10% more usage for all GPT-5.6 Sol subscriptions, with an emphasis on providing improvements without any feature restrictions.

UPDATE1h ago

Claude Code ignores admin SCIM plugin policies

An enterprise user highlighted a critical gap where marketplace plugin selection policies configured in the Claude Admin panel and mapped to SCIM groups do not sync or apply to Claude Code. This limitation breaks the centralized context administration model for organizations attempting broad, secure deployments of Claude across developer environments, as the CLI continues to rely on localized configuration controls instead of real-time organization policies.