Qwen3.5 prefill lags Qwen3 Coder in llama.cpp

// 125d agoINFRASTRUCTURE

Qwen3.5 prefill lags Qwen3 Coder in llama.cpp

A Reddit troubleshooting thread on LocalLLaMA found that Qwen3.5’s much slower prompt evaluation in llama.cpp is mostly an architecture and optimization story, not a simple regression. A llama.cpp maintainer said Qwen3.5 and Qwen3-Next use a newer design that trades slower prompt processing for steadier token generation, while commenters also pointed to still-maturing runtime optimizations and suboptimal VRAM fitting on a 16GB card.

// ANALYSIS

This is a good reminder that “same size, same quant, same server” does not mean comparable local inference behavior once architectures diverge.

–Qwen3.5’s official docs describe a new hybrid stack built from Gated Delta Networks plus sparse MoE, while Qwen3-Coder belongs to the older Qwen3 generation that local runtimes have had longer to tune
–The Reddit comparison is not apples to apples: Qwen3.5 was run with `--n-cpu-moe 1`, Qwen3-Coder with `--n-cpu-moe 33`, and both were given a very large 200K context window on a 16GB GPU
–A llama.cpp maintainer recommended switching from manual MoE placement to `--fit on`, and another commenter suggested tuning `-b`, `-ub`, `-fa on`, and `--fit-ctx` to improve prefill speed
–The thread matters because local model UX is increasingly gated by prompt ingestion and context handling, not just token/sec once generation starts
–For AI coding agents, slower prefill can still be an acceptable trade if Qwen3.5 delivers more stable long-context behavior and stronger real-world tool use

// TAGS

qwen3-5qwen3-coderllama-cppllminferencebenchmarkopen-source

DISCOVERED

125d ago

2026-03-09

PUBLISHED

125d ago

2026-03-09

RELEVANCE

6/ 10

AUTHOR

BitOk4326

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

VIDEO38m ago

Higgsfield drops developer CLI and MCP server

Higgsfield has launched a developer CLI and MCP server, allowing programmers and autonomous agents to programmatically trigger, customize, and edit marketing ads and cinematic videos directly through terminal commands. Demonstrated by developer Cole Medin using Anthropic's Claude Code and the Archon workflow engine, the toolkit enables fully automated video production pipelines.

OPEN SOURCE38m ago

AI Content Factory automates video ads

AI Content Factory is an open-source workflow that automates bulk marketing video generation from a product catalog. Built on the Archon agentic engine and Higgsfield CLI, it reduces costs by gating expensive video rendering behind cheap image exploration and human approval.

NEWS2h ago

George Hotz shares his enthusiasm for LLMs and open-source coding agents while criticizing doom-mongering and the overinflated valuations of frontier AI labs.

George Hotz (geohot) details his excitement for the practical applications of AI—such as LLMs, self-driving cars, video generation models, and AI coding agents—highlighting his successful setup of the open-source agent OpenCode on a local GLM-5.2 model. However, he strongly criticizes the prevailing industry hype, safety-related doom-mongering, and the multibillion-dollar valuations of frontier AI labs. Hotz argues that frontier labs will fail to capture most of the AI value because AI is a commodity driven by Moore's law and general computing progress. He also frames coding models not as autonomous creators, but as valuable productivity tools analogous to compilers, find-and-replace, or Stack Overflow that are changing the nature of programming.