Qwen3.5-27B local setup guide: llama.cpp vs. vLLM

// 86d agoTUTORIAL

Qwen3.5-27B local setup guide: llama.cpp vs. vLLM

A r/LocalLLaMA community member shares a practical setup guide for running Qwen3.5-27B locally, comparing llama.cpp and vLLM backends with concrete benchmarks and a working vLLM recipe that reaches 50–70 TPS on RTX 5090/Pro 6000 hardware.

// ANALYSIS

Community-driven local inference guides like this are often more actionable than official docs — the bug callouts alone (KV wipe in llama.cpp, broken tool call parsing in vLLM v0.17.1) save hours of debugging.

–llama.cpp is simpler but has an unresolved KV cache invalidation bug that forces full prompt reprocessing, killing throughput in long sessions
–vLLM is the recommended path but requires a manual patch for Qwen3.5 tool call parsing — official fix is open in GitHub PRs but unmerged as of the post
–The NVFP4+MTP community quant (osoleve/Qwen3.5-27B-Text-NVFP4-MTP) is the key to getting speculative decoding working on vLLM
–70 TPS at 256k context on RTX Pro 6000 (96GB) is a strong result for a 27B model locally
–Author notes Claude Code CLI handles tool calls better than Opencode post-patch — a useful signal for agentic local inference setups

// TAGS

qwenllminferenceopen-weightsself-hosteddevtool

DISCOVERED

86d ago

2026-03-15

PUBLISHED

86d ago

2026-03-15

RELEVANCE

6/ 10

AUTHOR

kvzrock2020

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL35m ago

Claude Fable 5 launch sparks massive developer backlash

Anthropic's Claude Fable 5 launch faces severe developer backlash over aggressive safety restrictions, high pricing, and a forced 30-day data retention policy. The model silently routes chemistry, biology, and cybersecurity requests to the older Opus 4.8 model, frustrating users with opaque downgrades and anti-distillation blocks.

MODEL36m ago

Designers praise Claude Fable 5 landing pages

Educator and designer Meng To highlighted Claude Fable 5's capability for creating landing pages on X, calling the model "a monster" for the task. Released in June 2026, Claude Fable 5 is Anthropic's latest Mythos-class AI model, featuring a 1-million-token context window, a 128,000-token output capacity, and advanced reasoning for long-horizon agentic workflows, making it highly effective for complex design and front-end code generation tasks.

MODEL1h ago

Claude Fable 5 hits Google Cloud

Anthropic's new Mythos-class frontier AI model, Claude Fable 5, is now generally available on Google Cloud's Agent Platform (Vertex AI). Designed for complex, long-horizon reasoning and autonomous workflows, Fable 5 is built for tasks such as software engineering, deep research, and multi-day agentic execution, featuring built-in safety guardrails that automatically redirect sensitive queries to Claude Opus 4.8.