ik_llama.cpp hits 26x speedup on Qwen 3.5
A specialized fork of llama.cpp introduces fused CUDA kernels for Qwen 3.5's hybrid Gated DeltaNet architecture, achieving a 26x speedup in prompt evaluation and 3.5x in generation.
Mainline llama.cpp's struggle with hybrid SSM architectures like Qwen 3.5 highlights a growing optimization gap as linear-time models gain traction.
- –Fused GDN kernels reduce graph splits from 34 to 2, offloading recurrent computation entirely from the CPU to the GPU.
- –A 26x jump in prompt processing (from 43 to 1,122 tok/sec) makes the 27B model viable for agentic coding even with mandatory re-processing.
- –Qwen 3.5's hybrid architecture is technically superior for long context but requires specific low-level kernel support that mainline has yet to integrate.
- –Pre-built Windows binaries with CUDA 12.8 and AVX512 VNNI are available via the Thireus fork as a drop-in replacement for llama-server.
DISCOVERED
80d ago
2026-03-22
PUBLISHED
80d ago
2026-03-22
RELEVANCE
AUTHOR
New-Inspection7034