OPEN_SOURCE ↗
REDDIT · REDDIT// 34d agoINFRASTRUCTURE
Qwen3.5 prefill lags Qwen3 Coder in llama.cpp
A Reddit troubleshooting thread on LocalLLaMA found that Qwen3.5’s much slower prompt evaluation in llama.cpp is mostly an architecture and optimization story, not a simple regression. A llama.cpp maintainer said Qwen3.5 and Qwen3-Next use a newer design that trades slower prompt processing for steadier token generation, while commenters also pointed to still-maturing runtime optimizations and suboptimal VRAM fitting on a 16GB card.
// ANALYSIS
This is a good reminder that “same size, same quant, same server” does not mean comparable local inference behavior once architectures diverge.
- –Qwen3.5’s official docs describe a new hybrid stack built from Gated Delta Networks plus sparse MoE, while Qwen3-Coder belongs to the older Qwen3 generation that local runtimes have had longer to tune
- –The Reddit comparison is not apples to apples: Qwen3.5 was run with `--n-cpu-moe 1`, Qwen3-Coder with `--n-cpu-moe 33`, and both were given a very large 200K context window on a 16GB GPU
- –A llama.cpp maintainer recommended switching from manual MoE placement to `--fit on`, and another commenter suggested tuning `-b`, `-ub`, `-fa on`, and `--fit-ctx` to improve prefill speed
- –The thread matters because local model UX is increasingly gated by prompt ingestion and context handling, not just token/sec once generation starts
- –For AI coding agents, slower prefill can still be an acceptable trade if Qwen3.5 delivers more stable long-context behavior and stronger real-world tool use
// TAGS
qwen3-5qwen3-coderllama-cppllminferencebenchmarkopen-source
DISCOVERED
34d ago
2026-03-09
PUBLISHED
34d ago
2026-03-09
RELEVANCE
6/ 10
AUTHOR
BitOk4326