OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT
Qwen3.6 NVFP4 MTP clears 60 tok/s
A community NVFP4 text-only Qwen3.6-27B variant with the MTP head restored is hitting roughly 60 tok/s on dual RTX 5060 Ti 16GB cards while still accepting a 204k-token context. It works, but only as a tightly tuned single-request setup with very little VRAM headroom.
// ANALYSIS
This is a strong proof point for local inference on 2x16GB Blackwell-class cards, but it is not a “drop it in and scale it” recipe. The real story is that the right quantization plus speculative decoding makes a 27B model surprisingly usable on consumer hardware.
- –Restoring the MTP head matters: speculative decoding lifts throughput from the low-50s tok/s to the low-60s tok/s range at 8K context.
- –The long-context win is real, but fragile: 204k fits only near the edge, and a 168k prefill already pushes per-GPU VRAM into the mid-15 GiB range.
- –`gpu_memory_utilization=0.95` versus `0.94` is the difference between success and failed KV allocation, which tells you how little slack this config has.
- –`max_num_seqs=1` keeps the benchmark honest: this is a single-stream latency setup, not a multi-user serving profile.
- –The startup OOM fallbacks and multi-minute compile/autotune phase are the tax for making the runtime path work on hardware this small.
// TAGS
llminferencegpubenchmarkopen-weightsqwen3.6-27b-text-nvfp4-mtp
DISCOVERED
3h ago
2026-04-29
PUBLISHED
6h ago
2026-04-29
RELEVANCE
8/ 10
AUTHOR
do_u_think_im_spooky