REDDIT · REDDIT// 4h agoBENCHMARK RESULT

Qwen3.6 Benchmarks Favor Plain Dual-GPU Runs

On a 2x RTX 5060 Ti 16GB setup, Qwen3.6-27B and Qwen3.6-35B-A3B are both usable, but the best results come from straightforward tensor-parallel serving rather than speculative decoding. The dense 27B model is the harder fit, while the 35B-A3B MoE looks much more at home on consumer dual-GPU rigs.

// ANALYSIS

The hot take is that this is less a "find the magic quant" problem and more a "respect the PCIe ceiling" problem. Once inter-GPU traffic becomes part of the decode path, speculative tricks can erase their own gains.

–Qwen3.6-35B-A3B is the clearer win on this hardware: vLLM NVFP4 with TP2 gives the best balance of prompt throughput and token generation.
–Qwen3.6-27B dense is more sensitive to backend and quant choice; vLLM beats llama.cpp on raw prompt speed, but token speed and TTFT trade off quickly.
–The llama.cpp layer-split setups are interesting mainly because they keep performance more balanced, not because they dominate every metric.
–The failed speculative decoding runs are a useful signal: on 2x 16GB cards, the bottleneck is probably data movement, not model compute.
–If the goal is practical local use, tuning backend placement and KV/cache strategy matters more than chasing speculative decoding on this class of dual-GPU setup.

// TAGS

qwen3.6qwen3.6-27bqwen3.6-35b-a3bbenchmarkinferencegpuvllmllamacpppcie

DISCOVERED

4h ago

2026-04-27

PUBLISHED

4h ago

2026-04-27

RELEVANCE

9/ 10

AUTHOR

ziphnor