BACK_TO_FEEDAICRIER_2
llama-cpp-turboquant Benchmarks 1M Context on M5 Max
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT

llama-cpp-turboquant Benchmarks 1M Context on M5 Max

This benchmark sweeps Qwen 3.6-35B-A3B across f16, q8_0, turbo3, and turbo4 KV cache types on an M5 Max, showing how the winner changes as context grows from 0 to 1M tokens. The main takeaway is that long-context behavior is workload-dependent: turbo3 holds the best ceiling for maximum context, while turbo4 can pull ahead on decode in some very long runs.

// ANALYSIS

The interesting part is not that KV quantization helps, but that it helps in different ways depending on whether you are paying for prefilling or decoding.

  • At short context, f16 still leads or stays essentially tied, so there is no free lunch for small prompts.
  • Around 128K, turbo3 catches q8_0 on prefill, which is a strong sign the workload is becoming bandwidth-bound.
  • turbo3 and turbo4 split cleanly by phase: turbo3 is better on prefill at 256K, while turbo4 wins decode and widens that lead at 512K.
  • Only turbo3 survives the full 1M-token run, so if the goal is absolute context ceiling on this machine, it is the practical choice.
  • The result is hardware-specific and symmetric K/V only, so the crossover points are more of a map for Apple Silicon than a universal rule.
// TAGS
benchmarkinferencellmopen-sourcegpullama-cpp-turboquant

DISCOVERED

3h ago

2026-04-28

PUBLISHED

5h ago

2026-04-28

RELEVANCE

8/ 10

AUTHOR

Defilan