Qwen3-VL hits Vulkan inference friction

// 90d agoINFRASTRUCTURE

Qwen3-VL hits Vulkan inference friction

A LocalLLaMA user reports empty image descriptions when running Qwen3-VL and Qwen2.5-VL through a Vulkan-compiled llama.cpp build. The thread points to the still-fragile state of local multimodal inference, where matching GGUF and mmproj files, fresh llama.cpp builds, and backend-specific vision support all matter.

// ANALYSIS

Qwen3-VL may be broadly supported in llama.cpp now, but “supported” still does not mean painless across every GPU backend.

–Vulkan remains a rougher path than CUDA or Metal for multimodal workloads, especially on edge cases involving vision encoders.
–Empty captions usually suggest the vision side is not actually being wired in, often because the mmproj file is missing, mismatched, or not loaded correctly.
–Qwen2.5-VL failing too makes this look less like a single-model issue and more like a local setup, prompt format, or backend support problem.
–For developers, the practical test is simple: verify the same model and mmproj on CPU or CUDA first, then isolate Vulkan-specific failures.

// TAGS

qwen3-vlqwen2.5-vlllama.cppmultimodalinferencegpuopen-weights

DISCOVERED

90d ago

2026-04-22

PUBLISHED

90d ago

2026-04-22

RELEVANCE

7/ 10

AUTHOR

WorldlinessTime634

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

VIDEO30m ago

ChatGPT Work powers celebratory video for 2M YouTube subscribers

To celebrate reaching 2 million YouTube subscribers, a video was created utilizing ChatGPT Work to manage the entire process from initial idea to final cut, highlighting its capabilities in executing creative projects.

MODEL47m ago

OpenRouter launches Gemini 3.6 Flash and Flash-Lite

OpenRouter has added Google's Gemini 3.6 Flash and Gemini 3.5 Flash-Lite to its platform. Gemini 3.6 Flash offers enhanced coding and planning with 17% fewer output tokens, while Flash-Lite delivers low-latency execution exceeding 150 tokens per second for subagents.

UPDATE49m ago

Vercel AI Gateway adds service tiers

Vercel announced service tiers within AI Gateway to help developers balance speed and costs for AI applications. Teams can select the priority tier for latency-sensitive tasks or the flex tier for cost-sensitive background workloads.