OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoINFRASTRUCTURE
Legacy servers face AVX2 bottleneck in LLM inference
A Reddit discussion investigates the performance of running local LLMs on a Dell PowerEdge R720 with dual Xeon E5-2650 v2 processors and 128GB of RAM but no GPU. While the high RAM capacity allows for loading large-parameter models, the Ivy Bridge-EP architecture's lack of AVX2 instructions and slow DDR3 memory bandwidth create significant performance hurdles, leading to suboptimal token-per-second rates compared to modern consumer hardware.
// ANALYSIS
Older rack servers are enticing for their high memory capacity, but they are often a performance trap for CPU-only inference due to missing modern instruction sets.
- –The absence of AVX2 support in Xeon E5-2600 v2 series chips results in a massive 2x to 4x performance penalty for popular LLM engines like llama.cpp.
- –Memory bandwidth is the primary constraint for inference; legacy DDR3 speeds severely limit the throughput of 70B+ parameter models, often resulting in unusable sub-1.0 t/s speeds.
- –Dual-socket NUMA (Non-Unified Memory Access) configurations add latency and complexity; failing to tune software for NUMA can slash already poor performance by another 30-50%.
- –To make an R720 viable for LLMs, developers should prioritize adding a legacy data center GPU like the NVIDIA Tesla P40 (24GB VRAM) rather than relying on the CPU.
// TAGS
dell-poweredge-r720llmself-hostedinferencecpuserver
DISCOVERED
4h ago
2026-04-12
PUBLISHED
6h ago
2026-04-12
RELEVANCE
7/ 10
AUTHOR
Typhoon-UK