BeeLlama v0.2.0 boosts DFlash on 3090
BeeLlama v0.2.0 is a substantial local-LLM runtime update centered on DFlash performance, safer execution, and broader model support. The release adds full Gemma 4 31B support with vision, improves Qwen 3.6 27B throughput by cutting DFlash overhead and tightening prefill/KV handling, and supports upstream-architecture DFlash GGUFs. The benchmark table is the main story: on a single RTX 3090, DFlash reaches 163.9 tok/s on Qwen 3.6 27B and 177.8 tok/s on Gemma 4 31B, while prompt processing stays near baseline. It reads like a targeted step toward making speculative decoding practical rather than merely faster in synthetic cases.
Strong release if you care about squeezing real throughput out of a single consumer GPU without giving up prompt-time performance.
- –The headline numbers are credible in context because prompt processing stays roughly baseline, which suggests the gains are concentrated in generation rather than masking prefill regressions.
- –Gemma 4 support plus vision makes this more than a micro-optimization release; it broadens the fork’s useful model surface area.
- –The stricter verifier fallback, draft/target validation, and safer CUDA path are the kind of changes that matter for day-to-day stability, not just benchmarks.
- –The acceptance rates show the tradeoff clearly: DFlash is much faster, but draft acceptance is uneven, so the practical win depends on prompt shape and model choice.
- –For local LLM power users on a 3090, this looks like one of the more meaningful incremental releases in the llama.cpp ecosystem this cycle.
DISCOVERED
4h ago
2026-05-23
PUBLISHED
15h ago
2026-05-22
RELEVANCE
AUTHOR
Anbeeld
