Qwen3.5-27B local setup guide: llama.cpp vs. vLLM
A r/LocalLLaMA community member shares a practical setup guide for running Qwen3.5-27B locally, comparing llama.cpp and vLLM backends with concrete benchmarks and a working vLLM recipe that reaches 50–70 TPS on RTX 5090/Pro 6000 hardware.
Community-driven local inference guides like this are often more actionable than official docs — the bug callouts alone (KV wipe in llama.cpp, broken tool call parsing in vLLM v0.17.1) save hours of debugging.
- –llama.cpp is simpler but has an unresolved KV cache invalidation bug that forces full prompt reprocessing, killing throughput in long sessions
- –vLLM is the recommended path but requires a manual patch for Qwen3.5 tool call parsing — official fix is open in GitHub PRs but unmerged as of the post
- –The NVFP4+MTP community quant (osoleve/Qwen3.5-27B-Text-NVFP4-MTP) is the key to getting speculative decoding working on vLLM
- –70 TPS at 256k context on RTX Pro 6000 (96GB) is a strong result for a 27B model locally
- –Author notes Claude Code CLI handles tool calls better than Opencode post-patch — a useful signal for agentic local inference setups
DISCOVERED
73d ago
2026-03-15
PUBLISHED
73d ago
2026-03-15
RELEVANCE
AUTHOR
kvzrock2020