Qwen3.6 sparks local multimodal RAG push
A LocalLLaMA user is exploring whether Qwen3.6-35B-A3B’s GGUF model plus its separate mmproj vision projector can support mixed image-and-text RAG in llama.cpp. The short answer: mmproj enables image understanding at inference time, but true multimodal retrieval still needs a shared image-text embedding model and vector index.
This is the practical edge of open multimodal models: generation is getting local, but retrieval architecture still matters.
- –The mmproj file is a vision projector for feeding images into Qwen3.6, not a general-purpose embedding model for indexing mixed media
- –A robust setup would use multimodal embeddings such as Qwen3-VL-Embedding or CLIP-style models for retrieval, then pass retrieved text, captions, or images into Qwen3.6 for synthesis
- –llama.cpp support makes local visual question answering realistic, but production RAG still needs chunking, metadata, OCR/caption pipelines, and vector search plumbing
- –The demand signal is clear: developers want open-weight multimodal systems that replace API-only vision RAG stacks without losing control of data
DISCOVERED
45d ago
2026-04-21
PUBLISHED
45d ago
2026-04-21
RELEVANCE
AUTHOR
Then-Analysis947