OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT
vLLM Beats llama.cpp on Quad 5060 Ti
On a quad RTX 5060 Ti rig, vLLM posts strong local-serving numbers: about 1,444.9 tokens/s for prompt processing and 47.4 tokens/s for generation. The setup also shows speculative decoding improving draft acceptance dramatically, and the author includes a practical local deployment path with `uv`, nightly wheels, and `systemd`.
// ANALYSIS
This is a solid real-world infra benchmark, not just a synthetic flex. The big takeaway is that vLLM’s serving stack can materially outperform llama.cpp on the same class of hardware when the workload is tuned for throughput.
- –Prompt throughput lands around 1.3x faster than the llama.cpp run cited here, while generation throughput is about 4.12x faster
- –The draft acceptance jump from 70.4% to 97.6% suggests the speculative decoding config is doing real work, not just adding complexity
- –The post is useful because it includes an actually reproducible deployment path, including `vllm serve` flags and a systemd wrapper
- –The comparison is still hardware- and format-dependent: vLLM is serving FP8, while the llama.cpp side uses a GGUF Q8_K_XL model, so this is best read as a practical local-serving result rather than a universal ranking
- –The mtp=3 tool-call errors note is a good reminder that higher throughput knobs can surface correctness issues before they surface benchmark wins
// TAGS
vllmllama-cppinferencegpubenchmarkself-hosted
DISCOVERED
3h ago
2026-04-29
PUBLISHED
6h ago
2026-04-29
RELEVANCE
8/ 10
AUTHOR
see_spot_ruminate