llama.cpp hits CPU wall with Qwen on old Xeons

// 123d agoINFRASTRUCTURE

llama.cpp hits CPU wall with Qwen on old Xeons

This Reddit troubleshooting thread is about trying to run a Qwen 35B-class quantized model with llama.cpp on dual Xeon E5-2620 v4 CPUs inside a Proxmox VM and getting unusably slow response times. The likely culprit is not a single bad setting but a fundamentally CPU- and memory-bandwidth-bound setup, especially for a large model on older server silicon without GPU offload.

// ANALYSIS

This is a classic case of local LLM expectations colliding with hardware reality: llama.cpp can run Qwen on CPUs, but a 35B-class model on older Broadwell-era Xeons is still slow enough to feel broken for interactive chat.

–llama.cpp explicitly supports Qwen models and CPU inference, but “supports” does not mean “practical” on every machine
–Large quantized models are usually bottlenecked by RAM bandwidth more than raw core count, which hurts old dual-socket Xeon boxes badly
–Running inside a VM adds another layer of overhead and can make NUMA, memory locality, and thread scheduling even worse
–For this class of hardware, smaller Qwen variants or lighter 7B-14B models are far more realistic than a 35B-class model for interactive use
–If the goal is usable latency rather than experimentation, GPU offload or a newer DDR5 desktop CPU is usually a much bigger win than adding more old server cores

// TAGS

llama-cppqwenllminferenceself-hosteddevtool

DISCOVERED

123d ago

2026-03-11

PUBLISHED

123d ago

2026-03-10

RELEVANCE

6/ 10

AUTHOR

JadedSoulGuy

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

NEWS14m ago

GPT-5.6 Sol in Claude Code outperforms Codex

Running OpenAI's GPT-5.6 Sol within Anthropic's Claude Code terminal environment reportedly outperforms legacy tools like Codex. The setup highlights the growing shift toward terminal-centric agentic loops for complex software tasks.

MODEL43m ago

Modelers drops Ascend NPU-optimized models

Modelers, the open-source model hub for Huawei's Ascend NPU ecosystem, has released a batch of twelve new fine-tuned model entries focused on hardware-specific efficiency. The release aims to build developer momentum and optimize AI inference for Ascend NPUs, though the impact of individual updates is diluted by the sheer number of simultaneous entries and limited public differentiation.

OPEN SOURCE1h ago

C# PS5 emulator SharpEmu boots 2D games

SharpEmu is an experimental, open-source PlayStation 5 emulator written in C# that targets Windows, Linux, and macOS. In its early development stages, the project has successfully booted simple 2D games like Dreaming Sarah and shown initial progress loading complex titles such as Demon's Souls Remake.