OPEN_SOURCE ↗
REDDIT · REDDIT// 37d agoBENCHMARK RESULT
ik_llama.cpp posts huge Qwen3.5 CPU gains
A Reddit benchmark on an AMD Ryzen AI 9 365 found ik_llama.cpp dramatically ahead of mainline llama.cpp for Unsloth's Qwen3.5 4B IQ4_XS on CPU, with roughly 5x faster prompt processing and 1.7x faster token generation. The result lines up with the fork's own positioning as a performance-focused llama.cpp variant tuned for CPU, hybrid inference, and newer quantization schemes.
// ANALYSIS
This looks less like a lucky benchmark and more like proof that the local inference ecosystem is splintering into specialized forks for specific model families and hardware targets.
- –The posted numbers are hard to ignore: about 281.6 t/s vs 56.5 t/s on prompt processing and 22.4 t/s vs 12.9 t/s on token generation for the same Qwen3.5 4B quant
- –ik_llama.cpp's README explicitly emphasizes better CPU performance, custom quants, and model-specific optimizations, so the gain is consistent with the project's design goals rather than a random anomaly
- –Comments from contributors and power users point to chunked delta-net work and repeated CPU-side optimization passes as likely reasons Qwen3 and Qwen3.5 perform especially well here
- –This is still an anecdotal community benchmark, not a controlled bake-off, and at least one commenter reported weaker gains or regressions in hybrid CPU+GPU setups
- –If these results hold broadly, mainline llama.cpp risks becoming the compatibility baseline while forks like ik_llama.cpp become the speed path for serious local CPU inference
// TAGS
ik-llama-cppllminferenceopen-sourcebenchmark
DISCOVERED
37d ago
2026-03-06
PUBLISHED
37d ago
2026-03-05
RELEVANCE
8/ 10
AUTHOR
EffectiveCeilingFan