BACK_TO_FEEDAICRIER_2
llama.cpp drafts Vulkan speedup for Qwen3.5
OPEN_SOURCE ↗
REDDIT · REDDIT// 32d agoPRODUCT UPDATE

llama.cpp drafts Vulkan speedup for Qwen3.5

A draft llama.cpp pull request adds Vulkan compute-shader support for GGML_OP_GATED_DELTA_NET, the core recurrence op used by Qwen3.5 and Qwen3-Next models. Early AMD benchmarks in the PR show roughly 18-23% faster token generation versus current master, with backend-op tests already passing.

// ANALYSIS

This is the kind of low-level optimization that matters more than flashy model launches for local inference users: if Vulkan gets these kernels, AMD laptops and handhelds become much more viable for modern Qwen-class models.

  • The PR targets a specific bottleneck in Qwen3.5 and Qwen3-Next rather than chasing generic benchmark wins, which makes the improvement especially relevant for those model families
  • Reported gains are strongest on token generation, including 38.04 to 46.64 t/s on Qwen3-Coder-Next UD-Q4_K_XL and 43.90 to 53.33 t/s on Qwen3.5-35B-A3B Q8_0
  • Support covers both standard and KDA variants, plus multiple state sizes and GQA broadcast, so this looks broader than a one-off hardware hack
  • The big remaining caveat is prefill throughput: the author says phase 2 still needs a chunked parallel kernel, so decode speeds may improve before long-context ingest does
  • Because this lives in a draft PR, the real story is momentum: Vulkan backend work in llama.cpp is getting serious enough to materially change AMD user experience
// TAGS
llama-cppllminferencegpuopen-source

DISCOVERED

32d ago

2026-03-10

PUBLISHED

32d ago

2026-03-10

RELEVANCE

8/ 10

AUTHOR

jacek2023