Threadripper Pro 5975 upgrade doubles LLM CPU inference speed
A hardware-constrained user seeks CPU-only inference optimizations for high-bit LLMs on Threadripper Pro systems. The investigation reveals a critical "half-bandwidth" bottleneck in Zen 2/3 chiplet designs and identifies specialized forks like ik_llama.cpp for performance gains.
The performance wall for CPU-only inference is often the Infinity Fabric link rather than just the raw RAM clock. The 3945WS is limited by its 2-CCD design, which effectively halves the 8-channel memory bandwidth to quad-channel levels regardless of RAM configuration. Upgrading to a 5975WX (4 CCDs) or 5995WX (8 CCDs) is the only way to saturate the memory controller and achieve the theoretical 200GB/s bandwidth required for large models. Specialized forks like ikawrakow's ik_llama.cpp provide unmerged SOTA kernels for FlashMLA and Fused FFN operations, which are critical for newer DeepSeek and Qwen variants. Justine Tunney's llamafile kernels offer up to a 500% speedup for prompt evaluation by bypassing standard BLAS overhead with hand-tuned SIMD. Additionally, TurboQuant and KV cache compression (PR #21089) remain the gold standard for maintaining speed during long-context planning tasks on high-bit quanta.
DISCOVERED
4h ago
2026-04-25
PUBLISHED
5h ago
2026-04-25
RELEVANCE
AUTHOR
HumanDrone8721