BACK_TO_FEEDAICRIER_2
fdash triples vLLM inference speed on Qwen 3.5
OPEN_SOURCE ↗
REDDIT · REDDIT// 7h agoBENCHMARK RESULT

fdash triples vLLM inference speed on Qwen 3.5

A developer benchmarked vLLM's speculative decoding methods on Qwen3.5-27B, finding the new fdash proposer nearly triples generation speed to 125 tokens per second. However, fdash currently lacks compatibility with 8-bit KV cache compression, demanding significantly more VRAM than native MTP alternatives.

// ANALYSIS

The speed gains from fdash are staggering, but its heavy memory tax keeps it out of reach for smaller GPU setups.

  • fdash achieved 124.96 TPS compared to the baseline 46.57 TPS without speculation, proving it as a top-tier decoding method for local inference
  • Qwen 3.5's native Multi-Token Prediction (MTP) is slower (84.57 TPS) but supports FP8 KV caching, making it the practical choice for VRAM-constrained environments
  • The lack of FP8 KV cache support for fdash forces users to choose between maximum throughput and memory efficiency until vLLM expands compatibility
// TAGS
vllmqwen-3.5llminferencegpubenchmark

DISCOVERED

7h ago

2026-04-12

PUBLISHED

11h ago

2026-04-12

RELEVANCE

8/ 10

AUTHOR

Sticking_to_Decaf