REDDIT · REDDIT// 22h agoBENCHMARK RESULT

Qwen3.6-27B Throughput Slips on 3090

The author can run Qwen3.6-27B on a 3090 with a large context window, but real agent loops quickly push throughput far below the quoted single-prompt numbers. The post argues that coding-agent performance should be judged by end-to-end task time and quality, not raw TPS.

// ANALYSIS

Hot take: most of the impressive TPS claims are for a narrow, favorable setup, not the workload people actually want for coding agents.

–The gap is real: one-shot generation on short prompts is much faster than repeated, tool-using, long-context agent loops.
–Long context is expensive even before reasoning starts; once you fill the window, prefill and attention costs dominate and the “headline TPS” becomes much less relevant.
–Thinking mode is not free. If you want better coding quality, you usually pay for it in latency and lower token throughput.
–The skepticism about MTP/DFlash-style approaches is reasonable: they can look strong in short demos, but quality and speed both tend to degrade as context and task complexity rise.
–The likely answer is not “skill issue” alone. It is a mismatch between benchmark framing and real harness behavior, plus the limits of current inference tricks.
–The practical metric for this use case is not raw TPS, but task completion time at acceptable quality across multi-turn coding jobs.

// TAGS

qwenqwen3-6-27blocal-firstllama.cppvllmlong-contextcoding-agentinference-speed3090

DISCOVERED

22h ago

2026-05-02

PUBLISHED

22h ago

2026-05-02

RELEVANCE

9/ 10

AUTHOR

Anbeeld