Qwen3.6-27B hits 85 TPS local

// 90d agoTUTORIAL

Qwen3.6-27B hits 85 TPS local

Wasif Basharat’s write-up shows how to run Qwen3.6-27B with vision, tool calling, prefix cache, and a 125K context window on a single RTX 3090 using an AutoRound INT4 quant, vLLM, and a stack of runtime patches. The bigger story is not just raw speed, but that a frontier-grade open model now looks practical on used consumer hardware if you are willing to live close to the metal.

// ANALYSIS

This is the kind of post that moves local inference from hobbyist flex to reproducible playbook: the model was already strong, but the engineering stack is what makes it usable.

–The headline numbers are substantial: 85 TPS sustained, 106 TPS peak, 125K context, and vision enabled on a 24 GB card is a serious density milestone for self-hosted inference.
–The article is really a deployment recipe, not a benchmark screenshot: it documents shard verification, patching around Ampere-specific vLLM/TurboQuant issues, and the exact tradeoffs that made the setup stable.
–It also underlines how fragile bleeding-edge open model serving still is; the path to “works overnight” currently runs through monkey-patches, model-specific quirks, and careful refusal to push past safe context limits.
–For AI developers, the practical implication is bigger than this one model: open dense models in the 27B class are getting close enough to cloud-class usefulness that infra craftsmanship matters as much as model quality.
–Community reaction on Reddit was strong precisely because this compresses privacy, cost control, and respectable multimodal throughput into hardware many local-LLM users already own.

// TAGS

qwen3-6-27bllmmultimodalinferencegpuself-hostedopen-weights

DISCOVERED

90d ago

2026-04-23

PUBLISHED

90d ago

2026-04-23

RELEVANCE

8/ 10

AUTHOR

AmazingDrivers4u

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE2h ago

Claude Voice Mode adds Opus, external tools

Anthropic has updated Claude Voice Mode to support the Opus model alongside external tool integrations called connectors. Users can now interact via voice to query emails, modify documents in tools like Notion, and execute voice-driven coding workflows including direct deployments to Vercel.

UPDATE2h ago

llama_cpp_canister Upgrade Delivers 2.8× ICP Speedup

The maintainer of llama_cpp_canister on the Internet Computer Protocol ($ICP) has upgraded to the latest upstream llama.cpp codebase. This live-tested update independently verified a 2.8× performance enhancement for running AI inference on-chain, transitioning speed gains from theoretical research into active deployment.

UPDATE2h ago

Superconductor highlights developer adoption of multi-agent orchestration

Superdot shared an update highlighting growing developer adoption of experimental orchestration features in Superconductor, its native application for agentic engineering. Designed to coordinate multi-agent coding execution with minimal latency, the platform enables developers to build complex automated AI agent workflows.