BACK_TO_FEEDAICRIER_2
llama.cpp hits CPU wall with Qwen on old Xeons
OPEN_SOURCE ↗
REDDIT · REDDIT// 32d agoINFRASTRUCTURE

llama.cpp hits CPU wall with Qwen on old Xeons

This Reddit troubleshooting thread is about trying to run a Qwen 35B-class quantized model with llama.cpp on dual Xeon E5-2620 v4 CPUs inside a Proxmox VM and getting unusably slow response times. The likely culprit is not a single bad setting but a fundamentally CPU- and memory-bandwidth-bound setup, especially for a large model on older server silicon without GPU offload.

// ANALYSIS

This is a classic case of local LLM expectations colliding with hardware reality: llama.cpp can run Qwen on CPUs, but a 35B-class model on older Broadwell-era Xeons is still slow enough to feel broken for interactive chat.

  • llama.cpp explicitly supports Qwen models and CPU inference, but “supports” does not mean “practical” on every machine
  • Large quantized models are usually bottlenecked by RAM bandwidth more than raw core count, which hurts old dual-socket Xeon boxes badly
  • Running inside a VM adds another layer of overhead and can make NUMA, memory locality, and thread scheduling even worse
  • For this class of hardware, smaller Qwen variants or lighter 7B-14B models are far more realistic than a 35B-class model for interactive use
  • If the goal is usable latency rather than experimentation, GPU offload or a newer DDR5 desktop CPU is usually a much bigger win than adding more old server cores
// TAGS
llama-cppqwenllminferenceself-hosteddevtool

DISCOVERED

32d ago

2026-03-11

PUBLISHED

32d ago

2026-03-10

RELEVANCE

6/ 10

AUTHOR

JadedSoulGuy