BACK_TO_FEEDAICRIER_2
llama.cpp MTP boosts Strix Halo throughput
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT

llama.cpp MTP boosts Strix Halo throughput

A Reddit benchmark shows llama.cpp’s incoming MTP support pushing an AMD Strix Halo AI Max 395 from roughly 40 tok/s to 60-80 tok/s on a Qwen3.6-35B MTP GGUF. Prompt processing appears unchanged, so the gain is mostly on generation speed.

// ANALYSIS

This is the kind of win that matters for local LLMs: same model, same hardware, materially better decode speed. On memory-bandwidth-limited APUs like Strix Halo, MTP looks like a real practical lever rather than a lab-only trick.

  • The test uses PR #22673 plus `--spec-type mtp --spec-draft-n-max 3`, so the speedup is tied to an in-flight llama.cpp feature, not a separate runtime
  • The model is large enough to stress local memory bandwidth, which makes the result more interesting than a small-model microbenchmark
  • The reported jump from ~40 tok/s to 60-80 tok/s is meaningful for interactive use, especially for long generations
  • Prompt processing staying flat suggests MTP is helping decode, not the full inference pipeline
  • The post also hints at hardware/backend sensitivity: Vulkan on this setup appears competitive, while ROCm is still being tuned
// TAGS
llminferencegpubenchmarkopen-sourcelocal-firstllama-cpp

DISCOVERED

3h ago

2026-05-06

PUBLISHED

4h ago

2026-05-05

RELEVANCE

8/ 10

AUTHOR

Edenar