Gemma 4 26B tops 600 tok/s
A Reddit benchmark says DFlash speculative decoding in vLLM pushed Gemma 4 26B from about 228 output tok/s to 578 tok/s on a single RTX 5090, with mean end-to-end latency falling from roughly 4.5 seconds to 1.7 seconds. The best serving tradeoff in the post was `num_speculative_tokens=13` with `max_num_batched_tokens=8192`.
This is a strong reminder that speculative decoding can turn a fast open model into something that feels genuinely server-grade on consumer hardware, but the result is still a narrow single-request benchmark, not a universal production guarantee.
- –The headline gain is real: 2.56x throughput on one GPU is enough to change how practical 26B-class serving feels.
- –The serving lesson matters more than the raw peak: `4096` had slightly better mean latency, but `8192` cleaned up the tail, which is what users actually feel.
- –The benchmark used a random dataset with `concurrency=1` and `request rate=1`, so real chat, code, or tool-use traffic may accept speculative tokens very differently.
- –If DFlash holds up on realistic workloads, it becomes a credible way to stretch RTX 5090-class boxes into serious local inference servers.
DISCOVERED
11h ago
2026-05-08
PUBLISHED
13h ago
2026-05-08
RELEVANCE
AUTHOR
chain-77