BACK_TO_FEEDAICRIER_2
Luce DFlash tops 2x on RTX 3090
OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoOPENSOURCE RELEASE

Luce DFlash tops 2x on RTX 3090

Luce DFlash ports DFlash speculative decoding into a standalone C++/CUDA GGUF stack on ggml, letting a single 24 GB RTX 3090 serve Qwen3.6-27B. The team reports a 1.98x mean throughput gain over autoregressive decoding across HumanEval, GSM8K, and Math500.

// ANALYSIS

This is a credible consumer-GPU infra win, not just a synthetic benchmark flex: it pairs speculative decoding with memory tricks that make 27B-class models practical on 24 GB cards. The tradeoff is clear, though: this is a tightly scoped CUDA-only runtime with greedy verify and a lot of hardware-specific tuning.

  • The benchmark numbers are strong enough to matter in practice, especially for local inference on a single 3090 where memory headroom is the real constraint.
  • The TQ3_0 KV cache and sliding-window decode are the bigger engineering story than the headline speedup, because they extend usable context without blowing VRAM.
  • The stack stays intentionally narrow: no llama.cpp runtime, no Python in the engine, no multi-GPU, and no alternative backends like ROCm or Metal.
  • The experimental Qwen3.6 support depends on a matched draft model that is still being trained, so the reported AL should improve if that draft gets better.
  • For developers building local serving paths, this is more interesting as a reference architecture for hardware-specific inference tuning than as a general-purpose server.
// TAGS
luce-dflashopen-sourceinferencegpullmcuda

DISCOVERED

4h ago

2026-04-27

PUBLISHED

7h ago

2026-04-27

RELEVANCE

9/ 10

AUTHOR

sandropuppo