Qwen3 Coder hits 1,207 tok/s
CloudRift published a reproducible benchmark and deployment guide showing Qwen3 Coder reaching 1,157 tok/s on an RTX 5090 and 1,207 tok/s peak throughput on an RTX PRO 6000, with vLLM ultimately outperforming SGLang under sustained load. The post also opens DeploDock’s GitHub-based benchmarking infrastructure to community PRs through March, making these sweeps easier to rerun and compare.
The bigger story is not just that Qwen3 Coder runs fast on Blackwell GPUs, but that someone finally packaged the tuning process into shareable recipes instead of one-off forum screenshots.
- –vLLM crushed SGLang on the 5090 AWQ setup, delivering 2.7x higher output throughput and much better TTFT for Qwen3-Coder-30B-A3B-Instruct-AWQ
- –The 32GB RTX 5090 held roughly 115K context without throughput collapse, while the 96GB PRO 6000 handled the full 262K context window with no measurable degradation
- –On PRO 6000, SGLang won the low-concurrency comparison, but vLLM pulled ahead hard at scale and peaked at 1,207 tok/s once concurrency was pushed to 40
- –DeploDock looks like the real infrastructure play here: recipes, benchmark matrices, and GitHub Actions results turn LLM serving optimization into something teams can version, review, and replicate
DISCOVERED
82d ago
2026-03-06
PUBLISHED
82d ago
2026-03-06
RELEVANCE
AUTHOR
NoVibeCoding