OPEN_SOURCE ↗
REDDIT · REDDIT// 27d agoTUTORIAL
Qwen3.5-27B local setup guide: llama.cpp vs. vLLM
A r/LocalLLaMA community member shares a practical setup guide for running Qwen3.5-27B locally, comparing llama.cpp and vLLM backends with concrete benchmarks and a working vLLM recipe that reaches 50–70 TPS on RTX 5090/Pro 6000 hardware.
// ANALYSIS
Community-driven local inference guides like this are often more actionable than official docs — the bug callouts alone (KV wipe in llama.cpp, broken tool call parsing in vLLM v0.17.1) save hours of debugging.
- –llama.cpp is simpler but has an unresolved KV cache invalidation bug that forces full prompt reprocessing, killing throughput in long sessions
- –vLLM is the recommended path but requires a manual patch for Qwen3.5 tool call parsing — official fix is open in GitHub PRs but unmerged as of the post
- –The NVFP4+MTP community quant (osoleve/Qwen3.5-27B-Text-NVFP4-MTP) is the key to getting speculative decoding working on vLLM
- –70 TPS at 256k context on RTX Pro 6000 (96GB) is a strong result for a 27B model locally
- –Author notes Claude Code CLI handles tool calls better than Opencode post-patch — a useful signal for agentic local inference setups
// TAGS
qwenllminferenceopen-weightsself-hosteddevtool
DISCOVERED
27d ago
2026-03-15
PUBLISHED
27d ago
2026-03-15
RELEVANCE
6/ 10
AUTHOR
kvzrock2020