BeeLlama v0.2.0 boosts DFlash on 3090

// 45d agoBENCHMARK RESULT

BeeLlama v0.2.0 boosts DFlash on 3090

BeeLlama v0.2.0 is a substantial local-LLM runtime update centered on DFlash performance, safer execution, and broader model support. The release adds full Gemma 4 31B support with vision, improves Qwen 3.6 27B throughput by cutting DFlash overhead and tightening prefill/KV handling, and supports upstream-architecture DFlash GGUFs. The benchmark table is the main story: on a single RTX 3090, DFlash reaches 163.9 tok/s on Qwen 3.6 27B and 177.8 tok/s on Gemma 4 31B, while prompt processing stays near baseline. It reads like a targeted step toward making speculative decoding practical rather than merely faster in synthetic cases.

// ANALYSIS

Strong release if you care about squeezing real throughput out of a single consumer GPU without giving up prompt-time performance.

–The headline numbers are credible in context because prompt processing stays roughly baseline, which suggests the gains are concentrated in generation rather than masking prefill regressions.
–Gemma 4 support plus vision makes this more than a micro-optimization release; it broadens the fork’s useful model surface area.
–The stricter verifier fallback, draft/target validation, and safer CUDA path are the kind of changes that matter for day-to-day stability, not just benchmarks.
–The acceptance rates show the tradeoff clearly: DFlash is much faster, but draft acceptance is uneven, so the practical win depends on prompt shape and model choice.
–For local LLM power users on a 3090, this looks like one of the more meaningful incremental releases in the llama.cpp ecosystem this cycle.

// TAGS

local-firstllama.cppdflashspeculative-decodingquantizationqwengemmacudabenchmarkinginference-optimization

DISCOVERED

45d ago

2026-05-23

PUBLISHED

45d ago

2026-05-22

RELEVANCE

9/ 10

AUTHOR

Anbeeld

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

LAUNCH1h ago

CodeClone unveils rundown for AI agents

rundown is a tool built for AI-assisted development loops to address the issue of AI agents consuming significant token context reading verbose command logs. Since agents frequently parse raw pytest, type-checking, and linting output, and can still misinterpret the outcome, rundown runs the configured checks and establishes a deterministic contract for verification.

OPEN SOURCE3h ago

OpenHands launches Agent Canvas control center

OpenHands has launched Agent Canvas, an open-source, self-hosted control plane for managing and automating multiple AI coding agents. Supporting runtimes like Claude Code and Codex via the Agent Client Protocol (ACP), the platform enables event-driven and scheduled engineering workflows across local, VM, and cloud backends.

NEWS6h ago

OpenAI leads AtCoder Heuristic Finals

During the AtCoder World Tour Finals 2026 Heuristic Finals, OpenAI's competitive programming agent claimed first place on all 50 preliminary test cases just 3.5 hours into the competition. This performance establishes a commanding lead for OpenAI in the Heuristic division exhibition match, where the model is competing against top human competitive programmers, building on its second-place finish in the 2025 tournament.