BACK_TO_FEEDAICRIER_2
Qwen3 speculative decoding tops 280 tok/s on 3090
OPEN_SOURCE ↗
REDDIT · REDDIT// 14d agoBENCHMARK RESULT

Qwen3 speculative decoding tops 280 tok/s on 3090

An HVAC-business benchmark on an RTX 3090 compared 16 GGUF models across Qwen2.5, Qwen3, and Qwen3.5 families, with Qwen3-8B plus a 1.7B draft hitting 279.9 tok/s at 100% acceptance. The bigger lesson is that serving-stack hygiene and deterministic business logic matter more than raw model size once hidden thinking tokens enter the picture.

// ANALYSIS

This benchmark makes a blunt point: local LLM success is mostly a systems problem now. Once the GPU is saturated, the winners are the stacks that pick the right draft model, tame chat templates, and keep formulas out of the prompt.

  • The `Qwen3-8B + 1.7B` combo is the real winner because 100% acceptance turns speculative decoding into a near-free speed multiplier rather than a fiddly optimization.
  • Qwen3.5's thinking mode is a benchmark landmine; if the serving layer doesn't cleanly disable it with `enable_thinking=false`, you're measuring a different workload.
  • The math failure is the most actionable result: every model missed the `4,811 / (1 - 0.47)` quote calculation, so pricing and margin math should stay in code.
  • The `35B-A3B`'s HVAC knowledge is real but bounded; it handled domain reasoning better than the smaller models, but the `32B` still mis-sized a garage, so scale alone isn't a substitute for judgment.
  • Cross-generation draft/target pairings are useful fallback options, but the lower acceptance rates keep same-family matches as the default sweet spot.
// TAGS
qwen3llminferencegpubenchmarkself-hostedopen-weights

DISCOVERED

14d ago

2026-03-29

PUBLISHED

14d ago

2026-03-28

RELEVANCE

8/ 10

AUTHOR

Alert_Cockroach_561