Qwen3 Coder hits 1,207 tok/s

// 128d agoBENCHMARK RESULT

Qwen3 Coder hits 1,207 tok/s

CloudRift published a reproducible benchmark and deployment guide showing Qwen3 Coder reaching 1,157 tok/s on an RTX 5090 and 1,207 tok/s peak throughput on an RTX PRO 6000, with vLLM ultimately outperforming SGLang under sustained load. The post also opens DeploDock’s GitHub-based benchmarking infrastructure to community PRs through March, making these sweeps easier to rerun and compare.

// ANALYSIS

The bigger story is not just that Qwen3 Coder runs fast on Blackwell GPUs, but that someone finally packaged the tuning process into shareable recipes instead of one-off forum screenshots.

–vLLM crushed SGLang on the 5090 AWQ setup, delivering 2.7x higher output throughput and much better TTFT for Qwen3-Coder-30B-A3B-Instruct-AWQ
–The 32GB RTX 5090 held roughly 115K context without throughput collapse, while the 96GB PRO 6000 handled the full 262K context window with no measurable degradation
–On PRO 6000, SGLang won the low-concurrency comparison, but vLLM pulled ahead hard at scale and peaked at 1,207 tok/s once concurrency was pushed to 40
–DeploDock looks like the real infrastructure play here: recipes, benchmark matrices, and GitHub Actions results turn LLM serving optimization into something teams can version, review, and replicate

// TAGS

qwen3-coderllminferencebenchmarkopen-sourcecloud

DISCOVERED

128d ago

2026-03-06

PUBLISHED

128d ago

2026-03-06

RELEVANCE

8/ 10

AUTHOR

NoVibeCoding

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE1h ago

prose stylesheet forces clean AI writing

prose is a lightweight, single-file Markdown prompt configuration that guides AI coding agents to communicate like a direct, confident senior engineer. Appended directly to local agent instruction files, it establishes clear rules to eliminate common AI patterns like cheesy setups, over-bulleted reasoning, and theatrical language.

MODEL4h ago

Reve 2.1 drops native 4K rendering

Reve has released version 2.1 of its creative image generation model, introducing native 4K rendering, object-level editing, and a new "Live Layers" feature. The update enables users to perform localized edits and manage layouts directly, catering to professional design workflows requiring precise control.

OPEN SOURCE4h ago

ABot-World simulates infinite 720p worlds on single GPU

ABot-World is an open-source, action-conditioned infinite world simulator designed to generate interactive 720p environments at 16 frames per second with low latency on a single desktop GPU. By utilizing an NVIDIA RTX 5090 and requiring just 19GB of GPU memory, this embodied world model offers physical compliance, action controllability, and zero-shot generalization, making real-time, interactive environment simulation accessible on consumer-grade hardware.