fdash triples vLLM inference speed on Qwen 3.5

// 91d agoBENCHMARK RESULT

fdash triples vLLM inference speed on Qwen 3.5

A developer benchmarked vLLM's speculative decoding methods on Qwen3.5-27B, finding the new fdash proposer nearly triples generation speed to 125 tokens per second. However, fdash currently lacks compatibility with 8-bit KV cache compression, demanding significantly more VRAM than native MTP alternatives.

// ANALYSIS

The speed gains from fdash are staggering, but its heavy memory tax keeps it out of reach for smaller GPU setups.

–fdash achieved 124.96 TPS compared to the baseline 46.57 TPS without speculation, proving it as a top-tier decoding method for local inference
–Qwen 3.5's native Multi-Token Prediction (MTP) is slower (84.57 TPS) but supports FP8 KV caching, making it the practical choice for VRAM-constrained environments
–The lack of FP8 KV cache support for fdash forces users to choose between maximum throughput and memory efficiency until vLLM expands compatibility

// TAGS

vllmqwen-3.5llminferencegpubenchmark

DISCOVERED

91d ago

2026-04-12

PUBLISHED

91d ago

2026-04-12

RELEVANCE

8/ 10

AUTHOR

Sticking_to_Decaf

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE45m ago

OpenDisplay turns iOS devices into Mac monitors

OpenDisplay is an open-source utility that streams macOS desktops to iPads or iPhones over USB or Wi-Fi, turning them into low-latency, high-resolution external monitors. Leveraging macOS's private CGVirtualDisplay API, ScreenCaptureKit, and VideoToolbox, it integrates directly into macOS Display settings as a true extended display without needing external servers or telemetry.

OPEN SOURCE45m ago

NASA releases SpaceWasm flight WebAssembly interpreter

spacewasm is a WebAssembly interpreter developed by NASA and Caltech for safety-critical flight software. Written in Rust, it decodes Wasm modules in a single pass into an optimized intermediate representation and utilizes a custom memory model with fixed-size allocation pages to guarantee deterministic execution and avoid memory panics in resource-constrained embedded systems.

OPEN SOURCE45m ago

Agent Skills guides agent UI design

Agent Skills is an open-source library and prompting system designed to help front-end coding agents like Cursor and Claude Code build premium user interfaces. The project provides reusable design guardrails and procedural workflows for advanced styling, GSAP animations, and WebGL.