REDDIT · REDDIT// 3h agoBENCHMARK RESULT

vLLM Beats llama.cpp on Quad 5060 Ti

On a quad RTX 5060 Ti rig, vLLM posts strong local-serving numbers: about 1,444.9 tokens/s for prompt processing and 47.4 tokens/s for generation. The setup also shows speculative decoding improving draft acceptance dramatically, and the author includes a practical local deployment path with `uv`, nightly wheels, and `systemd`.

// ANALYSIS

This is a solid real-world infra benchmark, not just a synthetic flex. The big takeaway is that vLLM’s serving stack can materially outperform llama.cpp on the same class of hardware when the workload is tuned for throughput.

–Prompt throughput lands around 1.3x faster than the llama.cpp run cited here, while generation throughput is about 4.12x faster
–The draft acceptance jump from 70.4% to 97.6% suggests the speculative decoding config is doing real work, not just adding complexity
–The post is useful because it includes an actually reproducible deployment path, including `vllm serve` flags and a systemd wrapper
–The comparison is still hardware- and format-dependent: vLLM is serving FP8, while the llama.cpp side uses a GGUF Q8_K_XL model, so this is best read as a practical local-serving result rather than a universal ranking
–The mtp=3 tool-call errors note is a good reminder that higher throughput knobs can surface correctness issues before they surface benchmark wins

// TAGS

vllmllama-cppinferencegpubenchmarkself-hosted

DISCOVERED

3h ago

2026-04-29

PUBLISHED

6h ago

2026-04-29

RELEVANCE

8/ 10

AUTHOR

see_spot_ruminate