Qwen3.6-27B hits 85 TPS local
Wasif Basharat’s write-up shows how to run Qwen3.6-27B with vision, tool calling, prefix cache, and a 125K context window on a single RTX 3090 using an AutoRound INT4 quant, vLLM, and a stack of runtime patches. The bigger story is not just raw speed, but that a frontier-grade open model now looks practical on used consumer hardware if you are willing to live close to the metal.
This is the kind of post that moves local inference from hobbyist flex to reproducible playbook: the model was already strong, but the engineering stack is what makes it usable.
- –The headline numbers are substantial: 85 TPS sustained, 106 TPS peak, 125K context, and vision enabled on a 24 GB card is a serious density milestone for self-hosted inference.
- –The article is really a deployment recipe, not a benchmark screenshot: it documents shard verification, patching around Ampere-specific vLLM/TurboQuant issues, and the exact tradeoffs that made the setup stable.
- –It also underlines how fragile bleeding-edge open model serving still is; the path to “works overnight” currently runs through monkey-patches, model-specific quirks, and careful refusal to push past safe context limits.
- –For AI developers, the practical implication is bigger than this one model: open dense models in the 27B class are getting close enough to cloud-class usefulness that infra craftsmanship matters as much as model quality.
- –Community reaction on Reddit was strong precisely because this compresses privacy, cost control, and respectable multimodal throughput into hardware many local-LLM users already own.
DISCOVERED
45d ago
2026-04-23
PUBLISHED
45d ago
2026-04-23
RELEVANCE
AUTHOR
AmazingDrivers4u
