Qwen3.6-27B benchmarks on dual V100s
The benchmark looks broadly sane: Qwen3.6-27B is running across two V100 32GB cards in llama.cpp tensor-parallel mode with flash attention and an unquantized KV cache. The big story is not a misconfiguration, but the expected throughput drop as prefill depth climbs into long-context territory.
This is a credible dual-V100 setup, and the 64K pp slowdown looks like the normal cost of deeper KV-cache prefill rather than a red flag. The main question is less “is it broken?” and more “are V100s the right tradeoff if your workload is mostly text and long context?”
- –`-sm tensor` plus `--flash-attn 1` is the right llama.cpp path for multi-GPU tensor split; llama.cpp also expects non-quantized KV cache in this mode.
- –`-d` sets context depth for the test, so each run is intentionally stressing a larger KV cache and more memory traffic.
- –Qwen3.6-27B is a fitting stress test here: it is a 27B dense model with a native 262K context window and a strong coding-agent bias.
- –The value of 2x V100 is VRAM headroom and context comfort, not raw speed; if latency is the priority, a 3090-class card will usually be faster.
- –The thread’s note about `64` CPU threads is worth revisiting, because that is probably more than one request needs in practice.
DISCOVERED
2h ago
2026-05-10
PUBLISHED
4h ago
2026-05-10
RELEVANCE
AUTHOR
starkruzr