BACK_TO_FEEDAICRIER_2
Qwen3.5-9B hits performance wall in llama.cpp
OPEN_SOURCE ↗
REDDIT · REDDIT// 1d agoMODEL RELEASE

Qwen3.5-9B hits performance wall in llama.cpp

Users report significantly lower throughput for the newly released Qwen3.5-9B model compared to predecessors, likely due to its hybrid architecture and unoptimized inference settings in popular local engines like llama.cpp.

// ANALYSIS

Qwen3.5's "thinking" capabilities and hybrid Gated DeltaNet/MoE architecture are currently outstripping local optimization efforts, causing a performance "cliff" on consumer hardware.

  • Architectural complexity (Gated Delta Networks) requires specific llama.cpp updates that are still maturing, leading to high CPU overhead and low GPU utilization.
  • Default "reasoning" modes add significant token overhead, making the model feel slower than dense 8B-9B counterparts despite superior benchmark scores.
  • High VRAM usage for the MoE layers often triggers silent system memory fallbacks on 16GB cards, slashing speeds by up to 70%.
  • Early optimization fixes include reducing --ubatch-size to match GPU cache and explicitly disabling the reasoning budget for standard chat tasks.
// TAGS
qwen3.5-9bllmai-codingreasoningopen-sourceinferencegpu

DISCOVERED

1d ago

2026-04-10

PUBLISHED

1d ago

2026-04-10

RELEVANCE

9/ 10

AUTHOR

soyalemujica