Lucebox DFlash hits 207 tok/s on RTX 3090
Lucebox Hub is an open-source LLM inference optimization repo focused on hand-tuned performance for specific hardware. The highlighted benchmark is its DFlash DDtree port for Qwen3.5-27B GGUF on an RTX 3090, where the project reports a demo peak of 207.6 tok/s and 129.5 tok/s on its HumanEval bench. The repo also includes a separate Megakernel release for Qwen3.5-0.8B, with writeups, benchmark tables, and reproducible build instructions.
This is less a product launch than a performance flex with real engineering substance. The interesting part is not just the raw tok/s number, but that they squeezed speculative decoding, tree verification, and a GGUF target into 24 GB on consumer hardware.
- –The 207 tok/s claim is tied to DFlash + DDTree on Qwen3.5-27B, not plain autoregressive decoding.
- –The repo is unusually transparent: it includes benchmark tables, hardware constraints, and implementation notes.
- –The project’s value prop is clear for local AI users: more throughput on existing RTX 3090-class cards without changing hardware.
- –The strongest audience fit is developers who care about inference kernels, quantization, and local model serving performance.
DISCOVERED
4h ago
2026-04-21
PUBLISHED
17h ago
2026-04-20
RELEVANCE
AUTHOR
GreenGames