OPEN_SOURCE ↗
REDDIT · REDDIT// 12d agoBENCHMARK RESULT
llama.cpp KV cache quantization backfires on DGX Spark
On NVIDIA DGX Spark, llama.cpp's q4_0 KV cache mode performs worse than f16 in the reported long-context benchmark, and even uses more memory at 64K tokens. The only quantized setting that still looks practical here is q8_0, which keeps most of the memory savings without the same runaway overhead.
// ANALYSIS
This is a hardware-specific reminder that compression only helps when memory pressure is the real bottleneck. On a 128GB unified-memory system, q4_0 turns into a false economy: the metadata and dequantization cost can outweigh the bytes saved.
- –The 64K result is the standout: prompt throughput falls from 282.7 tok/s to 21.3 tok/s, which looks like a pathological implementation or kernel-path issue, not just ordinary quantization overhead.
- –q8_0 is the sane middle ground here: it roughly halves KV cache size without the dramatic slowdown, so it preserves the benefit that actually matters on Spark.
- –The benchmark supports a broader point about local inference stacks: software quantization schemes are not automatically good on modern unified-memory hardware.
- –For Blackwell-class systems, the more interesting path is hardware-aware or zero-overhead approaches like NVFP4 or TurboQuant, not legacy cache formats that still depend on software dequant loops.
// TAGS
llama-cppbenchmarkinferencegpullm
DISCOVERED
12d ago
2026-03-31
PUBLISHED
12d ago
2026-03-31
RELEVANCE
8/ 10
AUTHOR
dentity9000