YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

llama.cpp speculative decoding swings with acceptance rates

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

llama.cpp speculative decoding swings with acceptance rates
OPEN LINK ↗
// 45d agoTUTORIAL

llama.cpp speculative decoding swings with acceptance rates

A Reddit user asks why the same speculative decoding setup in `llama.cpp` produces very different speedups across models: modest gains on Gemma 4 31B, smaller gains on Qwen 3.6, and a huge boost on Devstrall Small. The follow-up edit reports that switching to `--spec-type ngram-mod` and setting `--repeat-penalty 1.0` improved Qwen 3.6 to about 140 tks/s over a 100 tks/s baseline on minor code edits.

// ANALYSIS

Hot take: this is mostly an acceptance-rate story, not a “bigger model = faster” story. Speculative decoding only pays off when the draft tokens match what the target model would have produced anyway.

  • `ngram-map-k` depends on repeated local token windows, so it works best on workloads with stable patterns, like code rewrites and small edits.
  • The biggest gains happen when the prompt and continuation are repetitive enough that the verifier accepts long draft runs.
  • `--repeat-penalty 1.0` likely helped by reducing sampling distortion, which makes self-speculation more predictable on repetitive code text.
  • `ngram-mod` is often a better fit for edit-like workloads because it can reuse a shared hash pool and generate variable-length drafts.
  • Different models tokenize and continue code differently, so the same speculative setup can have very different acceptance rates and throughput.
  • A “665% speed increase” is usually a workload-specific outlier, not a universal model-quality ranking.
// TAGS
llamacppspeculative decodingngramlocal llminferencecode editingperformance

DISCOVERED

45d ago

2026-04-20

PUBLISHED

45d ago

2026-04-19

RELEVANCE

6/ 10

AUTHOR

GodComplecs