BACK_TO_FEEDAICRIER_2
llama.cpp speculative decoding swings with acceptance rates
OPEN_SOURCE ↗
REDDIT · REDDIT// 5h agoTUTORIAL

llama.cpp speculative decoding swings with acceptance rates

A Reddit user asks why the same speculative decoding setup in `llama.cpp` produces very different speedups across models: modest gains on Gemma 4 31B, smaller gains on Qwen 3.6, and a huge boost on Devstrall Small. The follow-up edit reports that switching to `--spec-type ngram-mod` and setting `--repeat-penalty 1.0` improved Qwen 3.6 to about 140 tks/s over a 100 tks/s baseline on minor code edits.

// ANALYSIS

Hot take: this is mostly an acceptance-rate story, not a “bigger model = faster” story. Speculative decoding only pays off when the draft tokens match what the target model would have produced anyway.

  • `ngram-map-k` depends on repeated local token windows, so it works best on workloads with stable patterns, like code rewrites and small edits.
  • The biggest gains happen when the prompt and continuation are repetitive enough that the verifier accepts long draft runs.
  • `--repeat-penalty 1.0` likely helped by reducing sampling distortion, which makes self-speculation more predictable on repetitive code text.
  • `ngram-mod` is often a better fit for edit-like workloads because it can reuse a shared hash pool and generate variable-length drafts.
  • Different models tokenize and continue code differently, so the same speculative setup can have very different acceptance rates and throughput.
  • A “665% speed increase” is usually a workload-specific outlier, not a universal model-quality ranking.
// TAGS
llamacppspeculative decodingngramlocal llminferencecode editingperformance

DISCOVERED

5h ago

2026-04-20

PUBLISHED

7h ago

2026-04-19

RELEVANCE

6/ 10

AUTHOR

GodComplecs