OPEN_SOURCE ↗
REDDIT · REDDIT// 32d agoPRODUCT UPDATE
llama.cpp drafts Vulkan speedup for Qwen3.5
A draft llama.cpp pull request adds Vulkan compute-shader support for GGML_OP_GATED_DELTA_NET, the core recurrence op used by Qwen3.5 and Qwen3-Next models. Early AMD benchmarks in the PR show roughly 18-23% faster token generation versus current master, with backend-op tests already passing.
// ANALYSIS
This is the kind of low-level optimization that matters more than flashy model launches for local inference users: if Vulkan gets these kernels, AMD laptops and handhelds become much more viable for modern Qwen-class models.
- –The PR targets a specific bottleneck in Qwen3.5 and Qwen3-Next rather than chasing generic benchmark wins, which makes the improvement especially relevant for those model families
- –Reported gains are strongest on token generation, including 38.04 to 46.64 t/s on Qwen3-Coder-Next UD-Q4_K_XL and 43.90 to 53.33 t/s on Qwen3.5-35B-A3B Q8_0
- –Support covers both standard and KDA variants, plus multiple state sizes and GQA broadcast, so this looks broader than a one-off hardware hack
- –The big remaining caveat is prefill throughput: the author says phase 2 still needs a chunked parallel kernel, so decode speeds may improve before long-context ingest does
- –Because this lives in a draft PR, the real story is momentum: Vulkan backend work in llama.cpp is getting serious enough to materially change AMD user experience
// TAGS
llama-cppllminferencegpuopen-source
DISCOVERED
32d ago
2026-03-10
PUBLISHED
32d ago
2026-03-10
RELEVANCE
8/ 10
AUTHOR
jacek2023