OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoBENCHMARK RESULT
Qwen3.6 NVFP4 tests 200k on 5090
A community NVFP4 quant of Qwen3.6-27B is shown running on a single RTX 5090 with vLLM, fp8 KV cache, and MTP while staying stable at a validated 200k context. The author’s repeated runs put generation roughly in the 65-75 tok/s range at 200k, with much lower TTFT on warm prefix-cache reuse.
// ANALYSIS
This is a solid proof-of-life for long-context local serving on consumer Blackwell, but it is a tuned benchmark rather than a drop-in default. The real story is that 200k context becomes practical on one 32GB card once you combine aggressive quantization, careful serving knobs, and a willingness to trade away some simplicity.
- –The stack is doing a lot of work here: NVFP4 weights, fp8 KV cache, flashinfer attention, chunked prefill, and MTP are all part of fitting and accelerating the model.
- –The 10-run stability pass matters more than the best single sweep result; the honest 200k generation number is closer to mid-60s to mid-70s tok/s than the headline peak.
- –Prefix caching changes the feel of the system for repeated long prompts, cutting TTFT from roughly a minute to a few seconds on warm reuse.
- –The official Qwen3.6 model already advertises native 262k context, so this post is notable for validation on a single consumer GPU, not for extending the model’s theoretical limit.
- –Accuracy remains an open question: NVFP4 scaling, speculative decoding, and experimental cache behavior all deserve separate evals before anyone treats this as a production baseline.
// TAGS
llmquantizationlong-contextinferencegpubenchmarkopen-weightsqwen3-6-27b-nvfp4
DISCOVERED
4h ago
2026-05-06
PUBLISHED
6h ago
2026-05-06
RELEVANCE
8/ 10
AUTHOR
Maheidem