BACK_TO_FEEDAICRIER_2
llama.cpp Q8 KV slows long context
OPEN_SOURCE ↗
REDDIT · REDDIT// 2h agoBENCHMARK RESULT

llama.cpp Q8 KV slows long context

A LocalLLaMA user reports that switching llama.cpp KV cache from FP16 to Q8 made Qwen 3.5 122B much slower on a MacBook M2 Max at long context, with tok/s appearing to halve around 60k tokens. The post is an anecdotal benchmark, but it highlights a real tradeoff in local long-context inference: memory savings can expose backend-specific quantization overhead.

// ANALYSIS

Q8 KV cache is often pitched as the conservative memory-saving option, so a large Apple Silicon slowdown is exactly the kind of edge case local inference users should measure instead of assuming. This is less a Qwen-only story than a reminder that KV quantization performance depends heavily on model architecture, context length, Metal kernels, flash attention behavior, and cache type combinations.

  • Q8 KV reduces cache memory versus FP16, but decode speed can suffer if quantized cache reads require extra dequantization or unfused attention paths.
  • Long context magnifies the cost because attention repeatedly scans a much larger KV history during generation.
  • Qwen 3.5 users have also been reporting sensitivity around BF16/FP16/Q8 cache choices, so correctness and speed need to be tested together.
  • For Mac users, the practical tuning loop is still empirical: compare FP16, BF16, Q8_0, batch size, flash attention, and latest llama.cpp builds on the exact model and context target.
  • The signal is useful, but the post needs reproducible commands, build info, and timing tables before it should be treated as a general benchmark.
// TAGS
llama-cppqwenllminferenceedge-aiself-hostedbenchmark

DISCOVERED

2h ago

2026-04-22

PUBLISHED

4h ago

2026-04-22

RELEVANCE

6/ 10

AUTHOR

No_Algae1753