OPEN_SOURCE ↗
REDDIT · REDDIT// 5h agoTUTORIAL
llama.cpp speculative decoding swings with acceptance rates
A Reddit user asks why the same speculative decoding setup in `llama.cpp` produces very different speedups across models: modest gains on Gemma 4 31B, smaller gains on Qwen 3.6, and a huge boost on Devstrall Small. The follow-up edit reports that switching to `--spec-type ngram-mod` and setting `--repeat-penalty 1.0` improved Qwen 3.6 to about 140 tks/s over a 100 tks/s baseline on minor code edits.
// ANALYSIS
Hot take: this is mostly an acceptance-rate story, not a “bigger model = faster” story. Speculative decoding only pays off when the draft tokens match what the target model would have produced anyway.
- –`ngram-map-k` depends on repeated local token windows, so it works best on workloads with stable patterns, like code rewrites and small edits.
- –The biggest gains happen when the prompt and continuation are repetitive enough that the verifier accepts long draft runs.
- –`--repeat-penalty 1.0` likely helped by reducing sampling distortion, which makes self-speculation more predictable on repetitive code text.
- –`ngram-mod` is often a better fit for edit-like workloads because it can reuse a shared hash pool and generate variable-length drafts.
- –Different models tokenize and continue code differently, so the same speculative setup can have very different acceptance rates and throughput.
- –A “665% speed increase” is usually a workload-specific outlier, not a universal model-quality ranking.
// TAGS
llamacppspeculative decodingngramlocal llminferencecode editingperformance
DISCOVERED
5h ago
2026-04-20
PUBLISHED
7h ago
2026-04-19
RELEVANCE
6/ 10
AUTHOR
GodComplecs