OPEN_SOURCE ↗
REDDIT · REDDIT// 1d agoMODEL RELEASE
Qwen3.5-9B hits performance wall in llama.cpp
Users report significantly lower throughput for the newly released Qwen3.5-9B model compared to predecessors, likely due to its hybrid architecture and unoptimized inference settings in popular local engines like llama.cpp.
// ANALYSIS
Qwen3.5's "thinking" capabilities and hybrid Gated DeltaNet/MoE architecture are currently outstripping local optimization efforts, causing a performance "cliff" on consumer hardware.
- –Architectural complexity (Gated Delta Networks) requires specific llama.cpp updates that are still maturing, leading to high CPU overhead and low GPU utilization.
- –Default "reasoning" modes add significant token overhead, making the model feel slower than dense 8B-9B counterparts despite superior benchmark scores.
- –High VRAM usage for the MoE layers often triggers silent system memory fallbacks on 16GB cards, slashing speeds by up to 70%.
- –Early optimization fixes include reducing --ubatch-size to match GPU cache and explicitly disabling the reasoning budget for standard chat tasks.
// TAGS
qwen3.5-9bllmai-codingreasoningopen-sourceinferencegpu
DISCOVERED
1d ago
2026-04-10
PUBLISHED
1d ago
2026-04-10
RELEVANCE
9/ 10
AUTHOR
soyalemujica