REDDIT · REDDIT// 4h agoINFRASTRUCTURE

Qwen3.6 Docker Stack Hits 118 tok/s

This open-source Docker stack packages vLLM serving for Qwen3.6-27B with Lorbus AutoRound INT4 quantization and MTP speculative decoding. The maintainer claims sustained throughput of 118 tokens/sec on dual RTX 3090s, with a simple compose-based setup and reusable model volume.

// ANALYSIS

The pitch is less about a new model and more about turning a high-end local LLM setup into something repeatable, portable, and fast enough to matter. For anyone trying to self-host a serious coding model on consumer GPUs, this is the kind of infra packaging that saves days of yak-shaving.

–The real value is the containerization: it bundles the quant, vLLM flags, model caching, and GPU autodetection into one reproducible path
–118 tok/s on 2x 3090s is a strong practical benchmark, especially since it reports sustained throughput rather than a cherry-picked peak
–Keeping the model on a host volume means upgrades are cheaper and less annoying, which is exactly what local-LLM operators want
–Vision support and OpenAI-compatible API make it more than a toy benchmark stack
–The main caveat is hardware specificity: the nicest numbers depend on dual 3090-class GPUs and the right vLLM nightlies

// TAGS

qwen36-27b-dockerllminferencegpuself-hostedopen-source

DISCOVERED

4h ago

2026-04-27

PUBLISHED

7h ago

2026-04-27

RELEVANCE

8/ 10

AUTHOR

tedivm