BACK_TO_FEEDAICRIER_2
DFlash pushes Qwen3.5-27B to 65 tok/s
OPEN_SOURCE ↗
REDDIT · REDDIT// 4d agoBENCHMARK RESULT

DFlash pushes Qwen3.5-27B to 65 tok/s

A Reddit demo shows Qwen3.5-27B running at roughly 65 tokens per second on a 2x RTX 3090 setup using DFlash speculative decoding in vLLM. The post is mainly a performance report, highlighting that a dense 27B model can become much more practical for local inference when paired with an optimized draft model and multi-GPU serving.

// ANALYSIS

DFlash looks like a real throughput unlock for local LLM inference, but this is fundamentally a benchmark result rather than a consumer product launch.

  • The claim is concrete: about 65 tok/s on dual 3090s, which is strong for a dense 27B model.
  • The setup is doing the heavy lifting: AWQ 4-bit target model, DFlash draft model, vLLM, tensor parallelism, and flash attention.
  • The main value here is practical latency reduction for local power users, not a new model capability.
  • Because this is a Reddit benchmark post, the result should be treated as an anecdotal performance snapshot, not a broad compatibility guarantee.
// TAGS
qwendflashspeculative-decodingvllmlocal-llm3090inference-optimizationllama-local

DISCOVERED

4d ago

2026-04-07

PUBLISHED

5d ago

2026-04-07

RELEVANCE

8/ 10

AUTHOR

Kryesh