Qwen3 speculative decoding tops 280 tok/s on 3090

// 104d agoBENCHMARK RESULT

Qwen3 speculative decoding tops 280 tok/s on 3090

An HVAC-business benchmark on an RTX 3090 compared 16 GGUF models across Qwen2.5, Qwen3, and Qwen3.5 families, with Qwen3-8B plus a 1.7B draft hitting 279.9 tok/s at 100% acceptance. The bigger lesson is that serving-stack hygiene and deterministic business logic matter more than raw model size once hidden thinking tokens enter the picture.

// ANALYSIS

This benchmark makes a blunt point: local LLM success is mostly a systems problem now. Once the GPU is saturated, the winners are the stacks that pick the right draft model, tame chat templates, and keep formulas out of the prompt.

–The `Qwen3-8B + 1.7B` combo is the real winner because 100% acceptance turns speculative decoding into a near-free speed multiplier rather than a fiddly optimization.
–Qwen3.5's thinking mode is a benchmark landmine; if the serving layer doesn't cleanly disable it with `enable_thinking=false`, you're measuring a different workload.
–The math failure is the most actionable result: every model missed the `4,811 / (1 - 0.47)` quote calculation, so pricing and margin math should stay in code.
–The `35B-A3B`'s HVAC knowledge is real but bounded; it handled domain reasoning better than the smaller models, but the `32B` still mis-sized a garage, so scale alone isn't a substitute for judgment.
–Cross-generation draft/target pairings are useful fallback options, but the lower acceptance rates keep same-family matches as the default sweet spot.

// TAGS

qwen3llminferencegpubenchmarkself-hostedopen-weights

DISCOVERED

104d ago

2026-03-29

PUBLISHED

104d ago

2026-03-28

RELEVANCE

8/ 10

AUTHOR

Alert_Cockroach_561

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE2h ago

Terminal Control is an open-source tool that enables AI coding agents to control, test, and capture real terminal applications through pseudo-terminals.

Terminal Control provides a Rust-based command-line interface and a TypeScript client library that allow external drivers, such as AI agents and automated testing suites, to interact directly with Terminal User Interfaces (TUIs). By offering a real pseudo-terminal environment, it overcomes the limitations of parsing plain text output, enabling precise keystroke injection, screen capture, timeline recording, and extraction of structured visual states like SVG and JSON.

NEWS2h ago

Greptile supports OSS with free accounts

The creator of the open-source repository claude-code-templates shared positive feedback on using Greptile for automated pull request reviews. Supported by a free open-source software (OSS) account from the Greptile team, the maintainer integrated the tool into incoming PRs, where it successfully generated diagrams of the code changes and left detailed reviews that caught real issues.

MODEL3h ago

LingBot-VA 2.0 launches robot control model

Developed by Robbyant under Ant Group, LingBot-VA 2.0 is a video-action foundation model built from scratch for native robot control. It employs a causal Mixture-of-Experts architecture and consistency distillation to reduce control loop latency to 142 ms.

Qwen3 speculative decoding tops 280 tok/s on 3090