OPEN_SOURCE ↗
REDDIT · REDDIT// 14d agoBENCHMARK RESULT
Qwen3 speculative decoding tops 280 tok/s on 3090
An HVAC-business benchmark on an RTX 3090 compared 16 GGUF models across Qwen2.5, Qwen3, and Qwen3.5 families, with Qwen3-8B plus a 1.7B draft hitting 279.9 tok/s at 100% acceptance. The bigger lesson is that serving-stack hygiene and deterministic business logic matter more than raw model size once hidden thinking tokens enter the picture.
// ANALYSIS
This benchmark makes a blunt point: local LLM success is mostly a systems problem now. Once the GPU is saturated, the winners are the stacks that pick the right draft model, tame chat templates, and keep formulas out of the prompt.
- –The `Qwen3-8B + 1.7B` combo is the real winner because 100% acceptance turns speculative decoding into a near-free speed multiplier rather than a fiddly optimization.
- –Qwen3.5's thinking mode is a benchmark landmine; if the serving layer doesn't cleanly disable it with `enable_thinking=false`, you're measuring a different workload.
- –The math failure is the most actionable result: every model missed the `4,811 / (1 - 0.47)` quote calculation, so pricing and margin math should stay in code.
- –The `35B-A3B`'s HVAC knowledge is real but bounded; it handled domain reasoning better than the smaller models, but the `32B` still mis-sized a garage, so scale alone isn't a substitute for judgment.
- –Cross-generation draft/target pairings are useful fallback options, but the lower acceptance rates keep same-family matches as the default sweet spot.
// TAGS
qwen3llminferencegpubenchmarkself-hostedopen-weights
DISCOVERED
14d ago
2026-03-29
PUBLISHED
14d ago
2026-03-28
RELEVANCE
8/ 10
AUTHOR
Alert_Cockroach_561