llama.cpp speculative decoding swings with acceptance rates

// 90d agoTUTORIAL

llama.cpp speculative decoding swings with acceptance rates

A Reddit user asks why the same speculative decoding setup in `llama.cpp` produces very different speedups across models: modest gains on Gemma 4 31B, smaller gains on Qwen 3.6, and a huge boost on Devstrall Small. The follow-up edit reports that switching to `--spec-type ngram-mod` and setting `--repeat-penalty 1.0` improved Qwen 3.6 to about 140 tks/s over a 100 tks/s baseline on minor code edits.

// ANALYSIS

Hot take: this is mostly an acceptance-rate story, not a “bigger model = faster” story. Speculative decoding only pays off when the draft tokens match what the target model would have produced anyway.

–`ngram-map-k` depends on repeated local token windows, so it works best on workloads with stable patterns, like code rewrites and small edits.
–The biggest gains happen when the prompt and continuation are repetitive enough that the verifier accepts long draft runs.
–`--repeat-penalty 1.0` likely helped by reducing sampling distortion, which makes self-speculation more predictable on repetitive code text.
–`ngram-mod` is often a better fit for edit-like workloads because it can reuse a shared hash pool and generate variable-length drafts.
–Different models tokenize and continue code differently, so the same speculative setup can have very different acceptance rates and throughput.
–A “665% speed increase” is usually a workload-specific outlier, not a universal model-quality ranking.

// TAGS

llamacppspeculative decodingngramlocal llminferencecode editingperformance

DISCOVERED

90d ago

2026-04-20

PUBLISHED

90d ago

2026-04-19

RELEVANCE

6/ 10

AUTHOR

GodComplecs

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE24m ago

Homelable maps, monitors home lab networks

Homelable offers self-hosters a visual and interactive canvas to map and monitor home lab networks instead of using static spreadsheets. The open-source tool features automated device discovery, active health checks, and integration with Proxmox, Home Assistant, and smart home standards.

UPDATE1h ago

Tesana enables game generation from reference images

AI game creation platform Tesana has added support for reference images, allowing users to upload a visual prompt alongside a text description to generate customized games. This feature helps guide the AI in generating characters, assets, and environments that align with a specific aesthetic style, simplifying the iteration loop for no-code creators.

OPEN SOURCE1h ago

video-use hits 17,000 GitHub stars

video-use is an open-source tool that automates the video editing pipeline by integrating raw footage with coding agents and natural language commands. Users provide footage and guidelines, and the tool automatically removes filler words, performs color grading, burns in subtitles, and evaluates cuts.