
OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoOPENSOURCE RELEASE
Luce DFlash tops 2x on RTX 3090
Luce DFlash ports DFlash speculative decoding into a standalone C++/CUDA GGUF stack on ggml, letting a single 24 GB RTX 3090 serve Qwen3.6-27B. The team reports a 1.98x mean throughput gain over autoregressive decoding across HumanEval, GSM8K, and Math500.
// ANALYSIS
This is a credible consumer-GPU infra win, not just a synthetic benchmark flex: it pairs speculative decoding with memory tricks that make 27B-class models practical on 24 GB cards. The tradeoff is clear, though: this is a tightly scoped CUDA-only runtime with greedy verify and a lot of hardware-specific tuning.
- –The benchmark numbers are strong enough to matter in practice, especially for local inference on a single 3090 where memory headroom is the real constraint.
- –The TQ3_0 KV cache and sliding-window decode are the bigger engineering story than the headline speedup, because they extend usable context without blowing VRAM.
- –The stack stays intentionally narrow: no llama.cpp runtime, no Python in the engine, no multi-GPU, and no alternative backends like ROCm or Metal.
- –The experimental Qwen3.6 support depends on a matched draft model that is still being trained, so the reported AL should improve if that draft gets better.
- –For developers building local serving paths, this is more interesting as a reference architecture for hardware-specific inference tuning than as a general-purpose server.
// TAGS
luce-dflashopen-sourceinferencegpullmcuda
DISCOVERED
4h ago
2026-04-27
PUBLISHED
7h ago
2026-04-27
RELEVANCE
9/ 10
AUTHOR
sandropuppo