OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoBENCHMARK RESULT
Qwen3.6 Benchmarks Favor Plain Dual-GPU Runs
On a 2x RTX 5060 Ti 16GB setup, Qwen3.6-27B and Qwen3.6-35B-A3B are both usable, but the best results come from straightforward tensor-parallel serving rather than speculative decoding. The dense 27B model is the harder fit, while the 35B-A3B MoE looks much more at home on consumer dual-GPU rigs.
// ANALYSIS
The hot take is that this is less a "find the magic quant" problem and more a "respect the PCIe ceiling" problem. Once inter-GPU traffic becomes part of the decode path, speculative tricks can erase their own gains.
- –Qwen3.6-35B-A3B is the clearer win on this hardware: vLLM NVFP4 with TP2 gives the best balance of prompt throughput and token generation.
- –Qwen3.6-27B dense is more sensitive to backend and quant choice; vLLM beats llama.cpp on raw prompt speed, but token speed and TTFT trade off quickly.
- –The llama.cpp layer-split setups are interesting mainly because they keep performance more balanced, not because they dominate every metric.
- –The failed speculative decoding runs are a useful signal: on 2x 16GB cards, the bottleneck is probably data movement, not model compute.
- –If the goal is practical local use, tuning backend placement and KV/cache strategy matters more than chasing speculative decoding on this class of dual-GPU setup.
// TAGS
qwen3.6qwen3.6-27bqwen3.6-35b-a3bbenchmarkinferencegpuvllmllamacpppcie
DISCOVERED
4h ago
2026-04-27
PUBLISHED
4h ago
2026-04-27
RELEVANCE
9/ 10
AUTHOR
ziphnor