OPEN_SOURCE ↗
REDDIT · REDDIT// 5h agoINFRASTRUCTURE
llama.cpp CPU thread tradeoffs resurface
A LocalLLaMA user asks whether an old many-core Xeon or a faster lower-core CPU is better for running large models slowly through llama.cpp on CPU, likely with DDR3 RAM. The practical answer is that llama.cpp can use multiple cores, but generation often hits memory bandwidth, NUMA, cache, and thread-scheduling limits before raw core count wins.
// ANALYSIS
More cores help only until the memory subsystem stops feeding them; for CPU-only LLM inference, old cheap server silicon can look attractive on capacity but disappoint on tokens per second.
- –llama.cpp exposes thread controls and recommends tuning around physical cores, not blindly maxing logical threads.
- –DDR3 is the real warning sign: large quantized models stream weights constantly, so memory bandwidth can dominate generation speed.
- –Dual-socket Xeons add RAM capacity and memory channels, but NUMA penalties can make “all cores” slower than a carefully pinned subset.
- –A faster modern CPU with AVX2/AVX-512, stronger single-core performance, and DDR4/DDR5 may beat an older high-core-count box for interactive use.
- –If the goal is hosting huge models cheaply and patiently, prioritize RAM capacity, memory channels, and measured llama.cpp benchmarks over core count alone.
// TAGS
llama-cppllminferenceself-hostedgpu
DISCOVERED
5h ago
2026-04-22
PUBLISHED
5h ago
2026-04-21
RELEVANCE
6/ 10
AUTHOR
VolkoTheWorst