BACK_TO_FEEDAICRIER_2
llama.cpp KV cache quantization backfires on DGX Spark
OPEN_SOURCE ↗
REDDIT · REDDIT// 12d agoBENCHMARK RESULT

llama.cpp KV cache quantization backfires on DGX Spark

On NVIDIA DGX Spark, llama.cpp's q4_0 KV cache mode performs worse than f16 in the reported long-context benchmark, and even uses more memory at 64K tokens. The only quantized setting that still looks practical here is q8_0, which keeps most of the memory savings without the same runaway overhead.

// ANALYSIS

This is a hardware-specific reminder that compression only helps when memory pressure is the real bottleneck. On a 128GB unified-memory system, q4_0 turns into a false economy: the metadata and dequantization cost can outweigh the bytes saved.

  • The 64K result is the standout: prompt throughput falls from 282.7 tok/s to 21.3 tok/s, which looks like a pathological implementation or kernel-path issue, not just ordinary quantization overhead.
  • q8_0 is the sane middle ground here: it roughly halves KV cache size without the dramatic slowdown, so it preserves the benefit that actually matters on Spark.
  • The benchmark supports a broader point about local inference stacks: software quantization schemes are not automatically good on modern unified-memory hardware.
  • For Blackwell-class systems, the more interesting path is hardware-aware or zero-overhead approaches like NVFP4 or TurboQuant, not legacy cache formats that still depend on software dequant loops.
// TAGS
llama-cppbenchmarkinferencegpullm

DISCOVERED

12d ago

2026-03-31

PUBLISHED

12d ago

2026-03-31

RELEVANCE

8/ 10

AUTHOR

dentity9000