OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT
DFlash boosts Qwen3.5 on 8GB RTX
A user benchmark shows llama.cpp’s DFlash speculative decoding speeding up Qwen3.5-35B-A3B on an 8GB RTX 2080 SUPER, lifting generation from about 26.8 tok/s to 35.6-35.8 tok/s. The trick was pairing a tiny DFlash draft model with MoE CPU offload and tuning draft length and offload settings for acceptance rate.
// ANALYSIS
This is a strong proof that speculative decoding can matter even on VRAM-starved consumer GPUs, not just on large server cards. The win is practical, but it is also clearly configuration-sensitive and still tied to a bleeding-edge llama.cpp PR.
- –The result depends on careful tuning: `-ncmoe 34` and `--draft-max 6` were the sweet spots, while longer drafts reduced acceptance and hurt throughput.
- –The setup shows a useful pattern for oversized MoE models: keep the main model mostly off GPU, then use a small draft model to recover some decode speed.
- –The reported acceptance rate was extremely high at about 99.3%, which is why the speedup held up despite the modest 8GB card.
- –This reads more like an early benchmark win than a polished feature release, but it is a meaningful data point for local inference users trying to squeeze large Qwen models onto older hardware.
// TAGS
llminferencegpubenchmarkopen-sourcedflashllama-cpp
DISCOVERED
3h ago
2026-05-01
PUBLISHED
3h ago
2026-05-01
RELEVANCE
8/ 10
AUTHOR
jwestra