OPEN_SOURCE ↗
REDDIT · REDDIT// 4d agoBENCHMARK RESULT
DFlash pushes Qwen3.5-27B to 65 tok/s
A Reddit demo shows Qwen3.5-27B running at roughly 65 tokens per second on a 2x RTX 3090 setup using DFlash speculative decoding in vLLM. The post is mainly a performance report, highlighting that a dense 27B model can become much more practical for local inference when paired with an optimized draft model and multi-GPU serving.
// ANALYSIS
DFlash looks like a real throughput unlock for local LLM inference, but this is fundamentally a benchmark result rather than a consumer product launch.
- –The claim is concrete: about 65 tok/s on dual 3090s, which is strong for a dense 27B model.
- –The setup is doing the heavy lifting: AWQ 4-bit target model, DFlash draft model, vLLM, tensor parallelism, and flash attention.
- –The main value here is practical latency reduction for local power users, not a new model capability.
- –Because this is a Reddit benchmark post, the result should be treated as an anecdotal performance snapshot, not a broad compatibility guarantee.
// TAGS
qwendflashspeculative-decodingvllmlocal-llm3090inference-optimizationllama-local
DISCOVERED
4d ago
2026-04-07
PUBLISHED
5d ago
2026-04-07
RELEVANCE
8/ 10
AUTHOR
Kryesh