OPEN_SOURCE ↗
REDDIT · REDDIT// 37d agoBENCHMARK RESULT
Qwen3 Coder hits 1,207 tok/s
CloudRift published a reproducible benchmark and deployment guide showing Qwen3 Coder reaching 1,157 tok/s on an RTX 5090 and 1,207 tok/s peak throughput on an RTX PRO 6000, with vLLM ultimately outperforming SGLang under sustained load. The post also opens DeploDock’s GitHub-based benchmarking infrastructure to community PRs through March, making these sweeps easier to rerun and compare.
// ANALYSIS
The bigger story is not just that Qwen3 Coder runs fast on Blackwell GPUs, but that someone finally packaged the tuning process into shareable recipes instead of one-off forum screenshots.
- –vLLM crushed SGLang on the 5090 AWQ setup, delivering 2.7x higher output throughput and much better TTFT for Qwen3-Coder-30B-A3B-Instruct-AWQ
- –The 32GB RTX 5090 held roughly 115K context without throughput collapse, while the 96GB PRO 6000 handled the full 262K context window with no measurable degradation
- –On PRO 6000, SGLang won the low-concurrency comparison, but vLLM pulled ahead hard at scale and peaked at 1,207 tok/s once concurrency was pushed to 40
- –DeploDock looks like the real infrastructure play here: recipes, benchmark matrices, and GitHub Actions results turn LLM serving optimization into something teams can version, review, and replicate
// TAGS
qwen3-coderllminferencebenchmarkopen-sourcecloud
DISCOVERED
37d ago
2026-03-06
PUBLISHED
37d ago
2026-03-06
RELEVANCE
8/ 10
AUTHOR
NoVibeCoding