OPEN_SOURCE ↗
REDDIT · REDDIT// 32d agoINFRASTRUCTURE
llama.cpp hits CPU wall with Qwen on old Xeons
This Reddit troubleshooting thread is about trying to run a Qwen 35B-class quantized model with llama.cpp on dual Xeon E5-2620 v4 CPUs inside a Proxmox VM and getting unusably slow response times. The likely culprit is not a single bad setting but a fundamentally CPU- and memory-bandwidth-bound setup, especially for a large model on older server silicon without GPU offload.
// ANALYSIS
This is a classic case of local LLM expectations colliding with hardware reality: llama.cpp can run Qwen on CPUs, but a 35B-class model on older Broadwell-era Xeons is still slow enough to feel broken for interactive chat.
- –llama.cpp explicitly supports Qwen models and CPU inference, but “supports” does not mean “practical” on every machine
- –Large quantized models are usually bottlenecked by RAM bandwidth more than raw core count, which hurts old dual-socket Xeon boxes badly
- –Running inside a VM adds another layer of overhead and can make NUMA, memory locality, and thread scheduling even worse
- –For this class of hardware, smaller Qwen variants or lighter 7B-14B models are far more realistic than a 35B-class model for interactive use
- –If the goal is usable latency rather than experimentation, GPU offload or a newer DDR5 desktop CPU is usually a much bigger win than adding more old server cores
// TAGS
llama-cppqwenllminferenceself-hosteddevtool
DISCOVERED
32d ago
2026-03-11
PUBLISHED
32d ago
2026-03-10
RELEVANCE
6/ 10
AUTHOR
JadedSoulGuy