OPEN_SOURCE ↗
REDDIT · REDDIT// 22h agoBENCHMARK RESULT
Qwen3.6-27B Throughput Slips on 3090
The author can run Qwen3.6-27B on a 3090 with a large context window, but real agent loops quickly push throughput far below the quoted single-prompt numbers. The post argues that coding-agent performance should be judged by end-to-end task time and quality, not raw TPS.
// ANALYSIS
Hot take: most of the impressive TPS claims are for a narrow, favorable setup, not the workload people actually want for coding agents.
- –The gap is real: one-shot generation on short prompts is much faster than repeated, tool-using, long-context agent loops.
- –Long context is expensive even before reasoning starts; once you fill the window, prefill and attention costs dominate and the “headline TPS” becomes much less relevant.
- –Thinking mode is not free. If you want better coding quality, you usually pay for it in latency and lower token throughput.
- –The skepticism about MTP/DFlash-style approaches is reasonable: they can look strong in short demos, but quality and speed both tend to degrade as context and task complexity rise.
- –The likely answer is not “skill issue” alone. It is a mismatch between benchmark framing and real harness behavior, plus the limits of current inference tricks.
- –The practical metric for this use case is not raw TPS, but task completion time at acceptable quality across multi-turn coding jobs.
// TAGS
qwenqwen3-6-27blocal-firstllama.cppvllmlong-contextcoding-agentinference-speed3090
DISCOVERED
22h ago
2026-05-02
PUBLISHED
22h ago
2026-05-02
RELEVANCE
9/ 10
AUTHOR
Anbeeld