REDDIT · REDDIT// 3h agoBENCHMARK RESULT

llama-cpp-turboquant Benchmarks 1M Context on M5 Max

This benchmark sweeps Qwen 3.6-35B-A3B across f16, q8_0, turbo3, and turbo4 KV cache types on an M5 Max, showing how the winner changes as context grows from 0 to 1M tokens. The main takeaway is that long-context behavior is workload-dependent: turbo3 holds the best ceiling for maximum context, while turbo4 can pull ahead on decode in some very long runs.

// ANALYSIS

The interesting part is not that KV quantization helps, but that it helps in different ways depending on whether you are paying for prefilling or decoding.

–At short context, f16 still leads or stays essentially tied, so there is no free lunch for small prompts.
–Around 128K, turbo3 catches q8_0 on prefill, which is a strong sign the workload is becoming bandwidth-bound.
–turbo3 and turbo4 split cleanly by phase: turbo3 is better on prefill at 256K, while turbo4 wins decode and widens that lead at 512K.
–Only turbo3 survives the full 1M-token run, so if the goal is absolute context ceiling on this machine, it is the practical choice.
–The result is hardware-specific and symmetric K/V only, so the crossover points are more of a map for Apple Silicon than a universal rule.

// TAGS

benchmarkinferencellmopen-sourcegpullama-cpp-turboquant

DISCOVERED

3h ago

2026-04-28

PUBLISHED

5h ago

2026-04-28

RELEVANCE

8/ 10

AUTHOR

Defilan