BACK_TO_FEEDAICRIER_2
Qwen3.6 vision stalls in llama.cpp
OPEN_SOURCE ↗
REDDIT · REDDIT// 5h agoINFRASTRUCTURE

Qwen3.6 vision stalls in llama.cpp

A LocalLLaMA user reports that Qwen3.6-35B-A3B GGUF takes about 95 seconds to begin responding after a 1080p image upload in llama.cpp on an RTX 4080 setup, while Gemma4 starts in roughly 10 seconds under similar conditions. Commenters point to the vision projector staying in RAM via --no-mmproj-offload, but another user says Qwen3.6 image processing also feels slower than Qwen3.5 even when loaded into VRAM.

// ANALYSIS

This is not a launch story, but it is the kind of deployment friction that decides whether a strong open multimodal model actually gets used locally.

  • Qwen’s official materials position Qwen3.6-35B-A3B as a 35B-total, 3B-active open MoE model with strong multimodal and coding performance, so slow image prefill in common GGUF workflows matters.
  • The user’s text-token speed looks healthy at around 65 t/s, suggesting the bottleneck is likely vision preprocessing, projector placement, llama.cpp integration, or model-specific multimodal overhead rather than normal decoding.
  • Gemma4’s much faster image startup on the same machine is the important comparison: local multimodal users care about time-to-first-token after image upload, not just tokens per second after generation starts.
  • The thread is still anecdotal and small, so this is a signal to watch rather than proof of a broad Qwen3.6 regression.
// TAGS
qwen3.6-35b-a3bqwenllama.cppmultimodalinferencegpuself-hostedopen-weights

DISCOVERED

5h ago

2026-04-22

PUBLISHED

7h ago

2026-04-21

RELEVANCE

6/ 10

AUTHOR

gilliancarps