OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoINFRASTRUCTURE
Qwen3.6 Docker Stack Hits 118 tok/s
This open-source Docker stack packages vLLM serving for Qwen3.6-27B with Lorbus AutoRound INT4 quantization and MTP speculative decoding. The maintainer claims sustained throughput of 118 tokens/sec on dual RTX 3090s, with a simple compose-based setup and reusable model volume.
// ANALYSIS
The pitch is less about a new model and more about turning a high-end local LLM setup into something repeatable, portable, and fast enough to matter. For anyone trying to self-host a serious coding model on consumer GPUs, this is the kind of infra packaging that saves days of yak-shaving.
- –The real value is the containerization: it bundles the quant, vLLM flags, model caching, and GPU autodetection into one reproducible path
- –118 tok/s on 2x 3090s is a strong practical benchmark, especially since it reports sustained throughput rather than a cherry-picked peak
- –Keeping the model on a host volume means upgrades are cheaper and less annoying, which is exactly what local-LLM operators want
- –Vision support and OpenAI-compatible API make it more than a toy benchmark stack
- –The main caveat is hardware specificity: the nicest numbers depend on dual 3090-class GPUs and the right vLLM nightlies
// TAGS
qwen36-27b-dockerllminferencegpuself-hostedopen-source
DISCOVERED
4h ago
2026-04-27
PUBLISHED
7h ago
2026-04-27
RELEVANCE
8/ 10
AUTHOR
tedivm