Qwen Code Burns 32K Tokens
A LocalLLaMA user describes trying to run Qwen Code on modest hardware with local Qwen3.5 models, then adding RAG and MCP to cut context size. A simple `git status` still shows a 32K-token session, exposing how much hidden agent scaffolding these coding tools carry.
This looks less like a broken prompt strategy and more like a classic agent-overhead problem: Qwen Code is designed to ship with a heavy hidden prompt, tool schemas, repo state, and conversation memory, so even a trivial request can inherit a large context footprint. The `32,162` input-token figure is misleadingly scary because `31,806` tokens were served from cache; the fresh payload was tiny. MCP and RAG only save tokens if they replace broad context, not if they add big tool manifests, long histories, or oversized retrieved chunks. On local inference, long-context agent workflows can become latency-bound before generation even starts, especially with 27B+ models on consumer hardware. The biggest gains usually come from prompt-budget discipline: fewer tools loaded at once, shorter system instructions, aggressive state summarization, and on-demand retrieval only. If the goal is a usable local coding agent on 32GB RAM, the host workflow probably needs more pruning before jumping from 9B to 27B or 35B.
DISCOVERED
5h ago
2026-04-18
PUBLISHED
6h ago
2026-04-18
RELEVANCE
AUTHOR
eur0child