OPEN_SOURCE ↗
REDDIT · REDDIT// 21d agoINFRASTRUCTURE
ik_llama.cpp hits 26x speedup on Qwen 3.5
A specialized fork of llama.cpp introduces fused CUDA kernels for Qwen 3.5's hybrid Gated DeltaNet architecture, achieving a 26x speedup in prompt evaluation and 3.5x in generation.
// ANALYSIS
Mainline llama.cpp's struggle with hybrid SSM architectures like Qwen 3.5 highlights a growing optimization gap as linear-time models gain traction.
- –Fused GDN kernels reduce graph splits from 34 to 2, offloading recurrent computation entirely from the CPU to the GPU.
- –A 26x jump in prompt processing (from 43 to 1,122 tok/sec) makes the 27B model viable for agentic coding even with mandatory re-processing.
- –Qwen 3.5's hybrid architecture is technically superior for long context but requires specific low-level kernel support that mainline has yet to integrate.
- –Pre-built Windows binaries with CUDA 12.8 and AVX512 VNNI are available via the Thireus fork as a drop-in replacement for llama-server.
// TAGS
ik-llama-cppqweninferencegpuopen-source
DISCOVERED
21d ago
2026-03-22
PUBLISHED
21d ago
2026-03-22
RELEVANCE
8/ 10
AUTHOR
New-Inspection7034