Docker repo optimizes Qwen 3.5 Vision local inference
A developer shared practical insights for running Qwen 3.5 Vision locally on vLLM and llama.cpp, highlighting solutions for long-video OOM errors and preprocessing speedups. The accompanying open-source repository provides Docker Compose profiles and a testing app for experimenting with 0.8B to 122B models.
Running vision models locally remains tricky, but community-driven optimizations like manual preprocessing and intelligent video chunking make it viable even on constrained hardware. Downsampling videos to 1 FPS and 360px before passing them to vLLM halves inference latency compared to native engine processing. Long-context vision tasks easily hit VRAM limits, necessitating application-level video chunking (≤300s) with 2-10s overlaps to preserve context. The 4B model struggles with JSON generation, making structured output libraries like Instructor mandatory for reliable data pipelines. Stable vLLM builds surprisingly outperformed nightly versions on newer Blackwell GPUs, emphasizing the need for hardware-specific testing.
DISCOVERED
10d ago
2026-04-01
PUBLISHED
10d ago
2026-04-01
RELEVANCE
AUTHOR
FantasticNature7590