llama.cpp Q8 KV slows long context
A LocalLLaMA user reports that switching llama.cpp KV cache from FP16 to Q8 made Qwen 3.5 122B much slower on a MacBook M2 Max at long context, with tok/s appearing to halve around 60k tokens. The post is an anecdotal benchmark, but it highlights a real tradeoff in local long-context inference: memory savings can expose backend-specific quantization overhead.
Q8 KV cache is often pitched as the conservative memory-saving option, so a large Apple Silicon slowdown is exactly the kind of edge case local inference users should measure instead of assuming. This is less a Qwen-only story than a reminder that KV quantization performance depends heavily on model architecture, context length, Metal kernels, flash attention behavior, and cache type combinations.
- –Q8 KV reduces cache memory versus FP16, but decode speed can suffer if quantized cache reads require extra dequantization or unfused attention paths.
- –Long context magnifies the cost because attention repeatedly scans a much larger KV history during generation.
- –Qwen 3.5 users have also been reporting sensitivity around BF16/FP16/Q8 cache choices, so correctness and speed need to be tested together.
- –For Mac users, the practical tuning loop is still empirical: compare FP16, BF16, Q8_0, batch size, flash attention, and latest llama.cpp builds on the exact model and context target.
- –The signal is useful, but the post needs reproducible commands, build info, and timing tables before it should be treated as a general benchmark.
DISCOVERED
2h ago
2026-04-22
PUBLISHED
4h ago
2026-04-22
RELEVANCE
AUTHOR
No_Algae1753