OPEN_SOURCE ↗
REDDIT · REDDIT// 6h agoBENCHMARK RESULT
Qwen3.6-27B hits 218K, stabilizes tool calls
The Qwen3.6-27B local serving stack now reaches roughly 218K context on a single RTX 3090, with about 50-66 TPS depending on workload. After fixing the Genesis PN12 patch application bug, long tool outputs around 25K tokens complete without OOMs, making the setup far more usable for agent-style work than the earlier faster-but-fragile config.
// ANALYSIS
This is less about chasing one more benchmark number and more about crossing the line from "impressive demo" to "actually usable local agent runtime." On a 24GB consumer GPU, stability under long prefill and tool output matters as much as raw decode speed.
- –218K context on a single 3090 is the main headline; the throughput tradeoff is real, but the usable window is much larger than the earlier ~125K setup
- –The no-OOM 25K-token tool-output path is the practical win for coding agents, since failures there usually matter more than a few extra tokens/sec
- –The PN12 anchor-drift bug shows how brittle local inference stacks can be: patch application correctness can dominate model choice
- –The second memory cliff around 50-60K for single-prompt workloads means this is still a tuned edge case, not a universally smooth default
- –The result reinforces that consumer-GPU AI is now an optimization problem across model, quantization, KV layout, and serving stack, not just a model benchmark
// TAGS
qwen3.6-27bllmgpuinferencebenchmarkagentopen-source
DISCOVERED
6h ago
2026-05-01
PUBLISHED
10h ago
2026-04-30
RELEVANCE
9/ 10
AUTHOR
AmazingDrivers4u