BACK_TO_FEEDAICRIER_2
DFlash boosts Qwen3.5 on 8GB RTX
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT

DFlash boosts Qwen3.5 on 8GB RTX

A user benchmark shows llama.cpp’s DFlash speculative decoding speeding up Qwen3.5-35B-A3B on an 8GB RTX 2080 SUPER, lifting generation from about 26.8 tok/s to 35.6-35.8 tok/s. The trick was pairing a tiny DFlash draft model with MoE CPU offload and tuning draft length and offload settings for acceptance rate.

// ANALYSIS

This is a strong proof that speculative decoding can matter even on VRAM-starved consumer GPUs, not just on large server cards. The win is practical, but it is also clearly configuration-sensitive and still tied to a bleeding-edge llama.cpp PR.

  • The result depends on careful tuning: `-ncmoe 34` and `--draft-max 6` were the sweet spots, while longer drafts reduced acceptance and hurt throughput.
  • The setup shows a useful pattern for oversized MoE models: keep the main model mostly off GPU, then use a small draft model to recover some decode speed.
  • The reported acceptance rate was extremely high at about 99.3%, which is why the speedup held up despite the modest 8GB card.
  • This reads more like an early benchmark win than a polished feature release, but it is a meaningful data point for local inference users trying to squeeze large Qwen models onto older hardware.
// TAGS
llminferencegpubenchmarkopen-sourcedflashllama-cpp

DISCOVERED

3h ago

2026-05-01

PUBLISHED

3h ago

2026-05-01

RELEVANCE

8/ 10

AUTHOR

jwestra