Qwen3.5-27B Slows Despite Dense Design

// 105d agoINFRASTRUCTURE

Qwen3.5-27B Slows Despite Dense Design

A LocalLLaMA user is seeing only ~30 tok/s from Qwen3.5-27B on OpenRouter, even via the fastest listed provider. For a dense 27B model, that is less a surprise than a serving problem: VRAM fit, quantization, batching, and prompt prefill all matter, and the posted TTFT spikes suggest queueing is hurting more than raw decode speed.

// ANALYSIS

30 tok/s is not the real headline here; the 30-95 second TTFT is.

–Qwen3.5-27B is dense, so every generated token uses all 27B parameters. MoE models with far fewer active parameters can be dramatically faster at the same nominal size.
–Qwen’s docs say Qwen3.5 defaults to thinking mode, which can add hidden reasoning work before the visible answer unless the provider disables it.
–OpenRouter’s provider table is routed serving, not a clean single-GPU benchmark, so batching, queue depth, prompt length, and CPU offload can swing throughput a lot.
–TTFT means time to first token, so those long waits include prompt processing and queueing as well as model compute.
–On consumer cards, a 27B model often does not fit comfortably at useful precision, so once layers spill out of VRAM, token speed drops fast.

// TAGS

qwen3-5-27bllminferencegpubenchmarkopen-weights

DISCOVERED

105d ago

2026-03-29

PUBLISHED

105d ago

2026-03-28

RELEVANCE

8/ 10

AUTHOR

Deep_Row_8729

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

VIDEO31m ago

Higgsfield drops developer CLI and MCP server

Higgsfield has launched a developer CLI and MCP server, allowing programmers and autonomous agents to programmatically trigger, customize, and edit marketing ads and cinematic videos directly through terminal commands. Demonstrated by developer Cole Medin using Anthropic's Claude Code and the Archon workflow engine, the toolkit enables fully automated video production pipelines.

OPEN SOURCE31m ago

AI Content Factory automates video ads

AI Content Factory is an open-source workflow that automates bulk marketing video generation from a product catalog. Built on the Archon agentic engine and Higgsfield CLI, it reduces costs by gating expensive video rendering behind cheap image exploration and human approval.

NEWS2h ago

George Hotz shares his enthusiasm for LLMs and open-source coding agents while criticizing doom-mongering and the overinflated valuations of frontier AI labs.

George Hotz (geohot) details his excitement for the practical applications of AI—such as LLMs, self-driving cars, video generation models, and AI coding agents—highlighting his successful setup of the open-source agent OpenCode on a local GLM-5.2 model. However, he strongly criticizes the prevailing industry hype, safety-related doom-mongering, and the multibillion-dollar valuations of frontier AI labs. Hotz argues that frontier labs will fail to capture most of the AI value because AI is a commodity driven by Moore's law and general computing progress. He also frames coding models not as autonomous creators, but as valuable productivity tools analogous to compilers, find-and-replace, or Stack Overflow that are changing the nature of programming.