REDDIT · REDDIT// 4h agoBENCHMARK RESULT

Qwen3.6-35B-A3B Hits 33 T/s on 6 GB VRAM

This post benchmarks Qwen3.6-35B-A3B on an ASUS Zephyrus G14 (2020) with an RTX 2060 Max-Q 6 GB, Ryzen 4900HS, and 24 GB RAM. The author reports moving from about 12 tok/s to 22-33 tok/s by changing quantization, CPU offload behavior, and speculative decoding settings.

// ANALYSIS

Hot take: this reads less like a model benchmark and more like a reminder that local LLM performance is often an engineering problem disguised as a hardware problem.

–The strongest practical finding is that clean IQ4_NL and APEX I-Compact beat “faster-looking” dynamic quants once CPU offload enters the picture.
–Ngram speculative decoding appears unusually effective for coding-agent workloads, with the author reporting very high draft acceptance and a peak around 33 tok/s.
–The automated overnight search didn’t uncover anything better than the manually tuned config, which suggests the last-mile gains were mostly from targeted human debugging.
–The `--poll 0` result looks suspicious: it may be a measurement artifact or a workload-specific anomaly rather than a general optimization.
–The battery result is the most surprising part: sustained 10 tok/s on a five-year-old laptop is genuinely useful for portable local inference.

// TAGS

qwenqwen3local-firstllama.cppquantizationcpu-offloadspeculative-decodingmoebenchmarklaptop-inference

DISCOVERED

4h ago

2026-05-03

PUBLISHED

4h ago

2026-05-03

RELEVANCE

8/ 10

AUTHOR

abhinand05