Gemma 4 26B tops 600 tok/s

// 11h agoBENCHMARK RESULT

Gemma 4 26B tops 600 tok/s

A Reddit benchmark says DFlash speculative decoding in vLLM pushed Gemma 4 26B from about 228 output tok/s to 578 tok/s on a single RTX 5090, with mean end-to-end latency falling from roughly 4.5 seconds to 1.7 seconds. The best serving tradeoff in the post was `num_speculative_tokens=13` with `max_num_batched_tokens=8192`.

// ANALYSIS

This is a strong reminder that speculative decoding can turn a fast open model into something that feels genuinely server-grade on consumer hardware, but the result is still a narrow single-request benchmark, not a universal production guarantee.

–The headline gain is real: 2.56x throughput on one GPU is enough to change how practical 26B-class serving feels.
–The serving lesson matters more than the raw peak: `4096` had slightly better mean latency, but `8192` cleaned up the tail, which is what users actually feel.
–The benchmark used a random dataset with `concurrency=1` and `request rate=1`, so real chat, code, or tool-use traffic may accept speculative tokens very differently.
–If DFlash holds up on realistic workloads, it becomes a credible way to stretch RTX 5090-class boxes into serious local inference servers.

// TAGS

llmbenchmarkinferencegpuquantizationgemma-4dflash

DISCOVERED

11h ago

2026-05-08

PUBLISHED

13h ago

2026-05-08

RELEVANCE

8/ 10

AUTHOR

chain-77

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE1h ago

OpenCode adds built-in which-key plugin

The upcoming OpenCode release adds a built-in which-key plugin that shows the currently active keybindings at any time, making the terminal UI easier to discover and use. The post is a repost of a short teaser, but the core signal is clear: OpenCode is continuing to polish its TUI ergonomics for power users who rely on keyboard-driven workflows.

NEWS1h ago

Anthropic’s SpaceX deal lifts Claude limits

Theo’s video covers Anthropic’s May 6, 2026 announcement of a compute partnership with SpaceX. The deal expands Claude capacity and raises Claude Code and Claude Opus limits.

BENCHMARK1h ago

ClickUp agents top ChatGPT, Claude evaluations

ClickUp’s benchmark report says its Certified Agents scored 96/100 and outperformed ChatGPT with connectors, Copilot, Notion agents, and Monday agents on execution-ready project planning. The claim is really about workflow orchestration and context inside the work system, not raw model intelligence.