OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT
llama.cpp MTP boosts Strix Halo throughput
A Reddit benchmark shows llama.cpp’s incoming MTP support pushing an AMD Strix Halo AI Max 395 from roughly 40 tok/s to 60-80 tok/s on a Qwen3.6-35B MTP GGUF. Prompt processing appears unchanged, so the gain is mostly on generation speed.
// ANALYSIS
This is the kind of win that matters for local LLMs: same model, same hardware, materially better decode speed. On memory-bandwidth-limited APUs like Strix Halo, MTP looks like a real practical lever rather than a lab-only trick.
- –The test uses PR #22673 plus `--spec-type mtp --spec-draft-n-max 3`, so the speedup is tied to an in-flight llama.cpp feature, not a separate runtime
- –The model is large enough to stress local memory bandwidth, which makes the result more interesting than a small-model microbenchmark
- –The reported jump from ~40 tok/s to 60-80 tok/s is meaningful for interactive use, especially for long generations
- –Prompt processing staying flat suggests MTP is helping decode, not the full inference pipeline
- –The post also hints at hardware/backend sensitivity: Vulkan on this setup appears competitive, while ROCm is still being tuned
// TAGS
llminferencegpubenchmarkopen-sourcelocal-firstllama-cpp
DISCOVERED
3h ago
2026-05-06
PUBLISHED
4h ago
2026-05-05
RELEVANCE
8/ 10
AUTHOR
Edenar