OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoBENCHMARK RESULT
Qwen3.6-27B INT4 tops 100 tps
A community quant of Qwen3.6-27B hits 105-108 tokens per second on a single RTX 5090 while keeping the model’s native 256k context window in vLLM 0.19. The recipe leans on AutoRound INT4, FlashInfer, fp8 KV cache, and MTP speculative decoding to squeeze both throughput and long-context capacity out of the box.
// ANALYSIS
This is a strong proof point for “smaller, better-quantized” beating brute-force hardware scaling for local inference. The more interesting story isn’t just the 100 tps number, it’s that the setup preserves the full 256k context without obvious compromise.
- –vLLM 0.19 plus FlashInfer and chunked prefill look like the practical stack here, not just a lab benchmark
- –AutoRound INT4 appears to be the enabler: small enough to fit, fast enough to matter, and reportedly with decent KLD versus NVFP4
- –MTP speculative decoding likely does a lot of the heavy lifting for the throughput jump, so this is a system result, not just a model result
- –The post is especially relevant for single-GPU local deployments, where long context usually forces a tradeoff against speed or batch size
- –This is a benchmark/result post, but it also functions as a useful deployment recipe for people chasing high-throughput local serving
// TAGS
qwen3.6-27b-int4-autoroundllminferencegpubenchmarkopen-source
DISCOVERED
4h ago
2026-04-26
PUBLISHED
8h ago
2026-04-26
RELEVANCE
9/ 10
AUTHOR
Kindly-Cantaloupe978