BACK_TO_FEEDAICRIER_2
Qwen3.6 NVFP4 tests 200k on 5090
OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoBENCHMARK RESULT

Qwen3.6 NVFP4 tests 200k on 5090

A community NVFP4 quant of Qwen3.6-27B is shown running on a single RTX 5090 with vLLM, fp8 KV cache, and MTP while staying stable at a validated 200k context. The author’s repeated runs put generation roughly in the 65-75 tok/s range at 200k, with much lower TTFT on warm prefix-cache reuse.

// ANALYSIS

This is a solid proof-of-life for long-context local serving on consumer Blackwell, but it is a tuned benchmark rather than a drop-in default. The real story is that 200k context becomes practical on one 32GB card once you combine aggressive quantization, careful serving knobs, and a willingness to trade away some simplicity.

  • The stack is doing a lot of work here: NVFP4 weights, fp8 KV cache, flashinfer attention, chunked prefill, and MTP are all part of fitting and accelerating the model.
  • The 10-run stability pass matters more than the best single sweep result; the honest 200k generation number is closer to mid-60s to mid-70s tok/s than the headline peak.
  • Prefix caching changes the feel of the system for repeated long prompts, cutting TTFT from roughly a minute to a few seconds on warm reuse.
  • The official Qwen3.6 model already advertises native 262k context, so this post is notable for validation on a single consumer GPU, not for extending the model’s theoretical limit.
  • Accuracy remains an open question: NVFP4 scaling, speculative decoding, and experimental cache behavior all deserve separate evals before anyone treats this as a production baseline.
// TAGS
llmquantizationlong-contextinferencegpubenchmarkopen-weightsqwen3-6-27b-nvfp4

DISCOVERED

4h ago

2026-05-06

PUBLISHED

6h ago

2026-05-06

RELEVANCE

8/ 10

AUTHOR

Maheidem