OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoBENCHMARK RESULT
Qwen3.6-35B-A3B Hits 33 T/s on 6 GB VRAM
This post benchmarks Qwen3.6-35B-A3B on an ASUS Zephyrus G14 (2020) with an RTX 2060 Max-Q 6 GB, Ryzen 4900HS, and 24 GB RAM. The author reports moving from about 12 tok/s to 22-33 tok/s by changing quantization, CPU offload behavior, and speculative decoding settings.
// ANALYSIS
Hot take: this reads less like a model benchmark and more like a reminder that local LLM performance is often an engineering problem disguised as a hardware problem.
- –The strongest practical finding is that clean IQ4_NL and APEX I-Compact beat “faster-looking” dynamic quants once CPU offload enters the picture.
- –Ngram speculative decoding appears unusually effective for coding-agent workloads, with the author reporting very high draft acceptance and a peak around 33 tok/s.
- –The automated overnight search didn’t uncover anything better than the manually tuned config, which suggests the last-mile gains were mostly from targeted human debugging.
- –The `--poll 0` result looks suspicious: it may be a measurement artifact or a workload-specific anomaly rather than a general optimization.
- –The battery result is the most surprising part: sustained 10 tok/s on a five-year-old laptop is genuinely useful for portable local inference.
// TAGS
qwenqwen3local-firstllama.cppquantizationcpu-offloadspeculative-decodingmoebenchmarklaptop-inference
DISCOVERED
4h ago
2026-05-03
PUBLISHED
4h ago
2026-05-03
RELEVANCE
8/ 10
AUTHOR
abhinand05