OPEN_SOURCE ↗
REDDIT · REDDIT// 1h agoBENCHMARK RESULT
2x3090 hits 250W sweet spot for Qwen3.6-27B
Benchmarks for Qwen3.6-27B on dual RTX 3090s identify 250W as the optimal power-to-performance balance for local inference. The testing highlights diminishing returns at higher power limits while maintaining peak throughput for long-context workloads.
// ANALYSIS
Local inference enthusiasts facing the 3090 "power wall" should prioritize 250W for efficiency, though 275W offers a measurable latency win for single-stream requests.
- –Speculative decoding via Multi-Token Prediction (MTP) is the primary performance driver, delivering a massive 2x throughput boost for coding tasks.
- –Tensor parallelism (TP=2) is mandatory to mitigate PCIe bandwidth bottlenecks on dual-GPU consumer setups, outperforming single-card configurations.
- –The combination of FP8 KV caching and INT4 AutoRound quantization enables a massive 200K context window within the 48GB VRAM limit.
- –Benchmark results confirm that LLM inference remains memory-bandwidth bound; aggressive 350W stock power limits generate excess heat with negligible token gains.
// TAGS
qwen3.6-27bllmvllmgpubenchmarkself-hostedinference
DISCOVERED
1h ago
2026-04-28
PUBLISHED
5h ago
2026-04-28
RELEVANCE
8/ 10
AUTHOR
JC1DA