BACK_TO_FEEDAICRIER_2
vLLM Beats llama.cpp on Quad 5060 Ti
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT

vLLM Beats llama.cpp on Quad 5060 Ti

On a quad RTX 5060 Ti rig, vLLM posts strong local-serving numbers: about 1,444.9 tokens/s for prompt processing and 47.4 tokens/s for generation. The setup also shows speculative decoding improving draft acceptance dramatically, and the author includes a practical local deployment path with `uv`, nightly wheels, and `systemd`.

// ANALYSIS

This is a solid real-world infra benchmark, not just a synthetic flex. The big takeaway is that vLLM’s serving stack can materially outperform llama.cpp on the same class of hardware when the workload is tuned for throughput.

  • Prompt throughput lands around 1.3x faster than the llama.cpp run cited here, while generation throughput is about 4.12x faster
  • The draft acceptance jump from 70.4% to 97.6% suggests the speculative decoding config is doing real work, not just adding complexity
  • The post is useful because it includes an actually reproducible deployment path, including `vllm serve` flags and a systemd wrapper
  • The comparison is still hardware- and format-dependent: vLLM is serving FP8, while the llama.cpp side uses a GGUF Q8_K_XL model, so this is best read as a practical local-serving result rather than a universal ranking
  • The mtp=3 tool-call errors note is a good reminder that higher throughput knobs can surface correctness issues before they surface benchmark wins
// TAGS
vllmllama-cppinferencegpubenchmarkself-hosted

DISCOVERED

3h ago

2026-04-29

PUBLISHED

6h ago

2026-04-29

RELEVANCE

8/ 10

AUTHOR

see_spot_ruminate