Qwen3.5 reasoning budget stumps vLLM, SGLang

// 115d agoTUTORIAL

Qwen3.5 reasoning budget stumps vLLM, SGLang

A Reddit user asks how to cap Qwen3.5’s reasoning budget when serving through vLLM or SGLang, saying the model keeps chewing through roughly 1,500 thinking tokens no matter what they try. The thread points to a familiar pain point: Qwen’s thinking controls exist, but they’re exposed differently across serving stacks.

// ANALYSIS

Hot take: this looks less like a model bug and more like a docs-and-runtime mismatch. Qwen3.5 has explicit thinking controls, but vLLM and SGLang expose them as parser/template flags rather than a portable “reasoning-budget” primitive, so it’s easy to misconfigure.

–Qwen’s repo says `enable_thinking=False` is the strict switch, and `/think` / `/no_think` can toggle behavior turn by turn.
–vLLM documents the `qwen3` reasoning parser and server defaults via `--default-chat-template-kwargs '{"enable_thinking": false}'`, with request-level kwargs overriding the default.
–SGLang’s Qwen 3.5 guide shows `--reasoning-parser qwen3` and `--tool-call-parser qwen3_coder`, but no separate budget knob in the basic launch flow.
–If the engine still allows free-form thinking, prompt tweaks alone probably won’t stop the model from burning tokens.
–For production use, this really wants a single compatibility matrix for “thinking on/off,” “budget cap,” and “framework support.”

// TAGS

qwen-3.5vllmsglangreasoningllminference

DISCOVERED

115d ago

2026-03-20

PUBLISHED

115d ago

2026-03-20

RELEVANCE

8/ 10

AUTHOR

DingyAtoll

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE50m ago

scroll-world launches scroll-driven 3D flight skill

scroll-world is an open-source, framework-agnostic agent skill that leverages Higgsfield to generate immersive, scroll-driven 3D camera flights through diorama scenes for landing pages. By rendering seamless connection clips between neighboring frames, it allows developers to build interactive 3D narrative websites navigated simply by scrolling, without requiring heavy game engines.

MODEL2h ago

OpenAI GPT-5.6 hits Amazon Bedrock

OpenAI's GPT-5.6 model family—including Sol, Terra, and Luna—is now generally available on Amazon Bedrock. Running on Bedrock's next-generation inference engine, the models support prompt caching with a 90% discount and match OpenAI's first-party pricing.

UPDATE2h ago

OpenRouter splits rankings by model weight

OpenRouter has updated its rankings platform by introducing separate leaderboards for open-weight and closed-weight models. This allows developers to track and compare usage statistics of proprietary, API-exclusive models against downloadable open-weight models.