BACK_TO_FEEDAICRIER_2
Qwen3.6 NVFP4 MTP clears 60 tok/s
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT

Qwen3.6 NVFP4 MTP clears 60 tok/s

A community NVFP4 text-only Qwen3.6-27B variant with the MTP head restored is hitting roughly 60 tok/s on dual RTX 5060 Ti 16GB cards while still accepting a 204k-token context. It works, but only as a tightly tuned single-request setup with very little VRAM headroom.

// ANALYSIS

This is a strong proof point for local inference on 2x16GB Blackwell-class cards, but it is not a “drop it in and scale it” recipe. The real story is that the right quantization plus speculative decoding makes a 27B model surprisingly usable on consumer hardware.

  • Restoring the MTP head matters: speculative decoding lifts throughput from the low-50s tok/s to the low-60s tok/s range at 8K context.
  • The long-context win is real, but fragile: 204k fits only near the edge, and a 168k prefill already pushes per-GPU VRAM into the mid-15 GiB range.
  • `gpu_memory_utilization=0.95` versus `0.94` is the difference between success and failed KV allocation, which tells you how little slack this config has.
  • `max_num_seqs=1` keeps the benchmark honest: this is a single-stream latency setup, not a multi-user serving profile.
  • The startup OOM fallbacks and multi-minute compile/autotune phase are the tax for making the runtime path work on hardware this small.
// TAGS
llminferencegpubenchmarkopen-weightsqwen3.6-27b-text-nvfp4-mtp

DISCOVERED

3h ago

2026-04-29

PUBLISHED

6h ago

2026-04-29

RELEVANCE

8/ 10

AUTHOR

do_u_think_im_spooky