BACK_TO_FEEDAICRIER_2
Threadripper Pro 5975 upgrade doubles LLM CPU inference speed
OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoINFRASTRUCTURE

Threadripper Pro 5975 upgrade doubles LLM CPU inference speed

A hardware-constrained user seeks CPU-only inference optimizations for high-bit LLMs on Threadripper Pro systems. The investigation reveals a critical "half-bandwidth" bottleneck in Zen 2/3 chiplet designs and identifies specialized forks like ik_llama.cpp for performance gains.

// ANALYSIS

The performance wall for CPU-only inference is often the Infinity Fabric link rather than just the raw RAM clock. The 3945WS is limited by its 2-CCD design, which effectively halves the 8-channel memory bandwidth to quad-channel levels regardless of RAM configuration. Upgrading to a 5975WX (4 CCDs) or 5995WX (8 CCDs) is the only way to saturate the memory controller and achieve the theoretical 200GB/s bandwidth required for large models. Specialized forks like ikawrakow's ik_llama.cpp provide unmerged SOTA kernels for FlashMLA and Fused FFN operations, which are critical for newer DeepSeek and Qwen variants. Justine Tunney's llamafile kernels offer up to a 500% speedup for prompt evaluation by bypassing standard BLAS overhead with hand-tuned SIMD. Additionally, TurboQuant and KV cache compression (PR #21089) remain the gold standard for maintaining speed during long-context planning tasks on high-bit quanta.

// TAGS
llama.cppcpuinferenceinfrastructureopen-sourcethreadripperbenchmark

DISCOVERED

4h ago

2026-04-25

PUBLISHED

5h ago

2026-04-25

RELEVANCE

8/ 10

AUTHOR

HumanDrone8721