Dual RTX 3060s power $400 Qwen3.6-27B rig
This post benchmarks an ultra-budget dual-RTX 3060 setup running Unsloth’s Qwen3.6-27B GGUF variants in llama.cpp on CUDA. The author reports strong, stable throughput on a dated PCIe 3.0 x8/x8 platform, with MTP pushing generation into the low-40 t/s range and non-MTP mode delivering more context at a still-solid ~30 t/s. The main tradeoff is that tensor parallel mode currently blocks KV-cache quantization, which caps usable context and makes very long prompts awkward.
Hot take: this is the kind of result that makes “budget local LLM rig” feel real rather than theoretical. Two cheap 3060s plus CUDA and llama.cpp outperform the expected value proposition by a lot, especially on stability.
- –Prefill stays healthy even at 12k context, landing around 456 t/s with MTP and still above 600 t/s at initial peak.
- –Generation reaches 43.26 t/s with MTP and about 31 t/s without it, which is a strong tradeoff for local use.
- –The old i7-4770K/Z87 platform is not the bottleneck people would assume, because PCIe 3.0 x8/x8 is competitive with many newer consumer board lane splits.
- –The biggest downside is architectural: `-sm tensor` cannot currently be combined with KV-cache quantization, so 160k-class contexts are out of reach in this configuration.
- –vLLM appears to be the wrong tool for this VRAM-constrained use case here; llama.cpp is the practical winner.
DISCOVERED
2h ago
2026-05-27
PUBLISHED
4h ago
2026-05-26
RELEVANCE
AUTHOR
akira3weet