llama.cpp Q8 KV slows long context

// 90d agoBENCHMARK RESULT

llama.cpp Q8 KV slows long context

A LocalLLaMA user reports that switching llama.cpp KV cache from FP16 to Q8 made Qwen 3.5 122B much slower on a MacBook M2 Max at long context, with tok/s appearing to halve around 60k tokens. The post is an anecdotal benchmark, but it highlights a real tradeoff in local long-context inference: memory savings can expose backend-specific quantization overhead.

// ANALYSIS

Q8 KV cache is often pitched as the conservative memory-saving option, so a large Apple Silicon slowdown is exactly the kind of edge case local inference users should measure instead of assuming. This is less a Qwen-only story than a reminder that KV quantization performance depends heavily on model architecture, context length, Metal kernels, flash attention behavior, and cache type combinations.

–Q8 KV reduces cache memory versus FP16, but decode speed can suffer if quantized cache reads require extra dequantization or unfused attention paths.
–Long context magnifies the cost because attention repeatedly scans a much larger KV history during generation.
–Qwen 3.5 users have also been reporting sensitivity around BF16/FP16/Q8 cache choices, so correctness and speed need to be tested together.
–For Mac users, the practical tuning loop is still empirical: compare FP16, BF16, Q8_0, batch size, flash attention, and latest llama.cpp builds on the exact model and context target.
–The signal is useful, but the post needs reproducible commands, build info, and timing tables before it should be treated as a general benchmark.

// TAGS

llama-cppqwenllminferenceedge-aiself-hostedbenchmark

DISCOVERED

90d ago

2026-04-22

PUBLISHED

90d ago

2026-04-22

RELEVANCE

6/ 10

AUTHOR

No_Algae1753

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

NEWS1h ago

Anthropic Agrees to Record $1.5B Copyright Settlement

Anthropic has reached a landmark $1.5 billion settlement in a copyright lawsuit filed by authors who accused the company of using their books without permission to train its Claude AI models. The resolution marks one of the largest financial payouts in generative AI litigation to date, resolving major legal exposure for Anthropic while signaling intensified scrutiny around training data sourcing across the AI industry.

INFRA3h ago

NVIDIA Details Vera Rubin Agentic AI Architecture

NVIDIA unveiled its Vera Rubin architecture, marking a transition toward purpose-built systems for complex agentic AI reasoning rather than a conventional accelerator refresh. The full-stack platform integrates custom Vera CPUs, Rubin GPUs equipped with 288GB of HBM4 memory, and advanced NVLink 6 networking infrastructure to address key memory and communication bottlenecks in multi-step AI workflows.

INFRA3h ago

Meta builds Switchboard AI router to cut costs

Meta is building an internal AI model routing system named Switchboard to curb escalating inference costs across its AI services. Developed within Meta's AAI Labs incubator, it evaluates prompt complexity to route routine tasks to smaller, lower-cost models while preserving frontier models for complex requests.