OPEN_SOURCE ↗
REDDIT · REDDIT// 12d agoMODEL RELEASE
Qwen3.5 Omni Plus lands with voice, video
Alibaba's Qwen3.5-Omni is a native multimodal model for text, image, audio, and video, now exposed as Plus, Flash, and Light APIs. The launch claims 100M+ hours of audio-visual training data, 113-language speech recognition, 36-language speech generation, and workflows like video captioning and audio-visual vibe coding.
// ANALYSIS
This feels like Qwen trying to collapse the whole voice/video stack into one model contract instead of another stitched-together demo. If the latency and reliability hold up outside the launch video, the real win is developer ergonomics more than leaderboard theater.
- –Native text, image, audio, and video handling should cut a lot of glue code for teams building voice, media, and agent workflows
- –The 113-language recognition and 36-language speech generation claims matter more for real products than raw benchmark bragging rights
- –Audio-visual vibe coding is the flashy hook, but it needs independent developer testing before anyone bets on it
- –Script-level video captioning, scene cuts, and speaker mapping look quietly useful for creators, editors, and support teams
- –Plus/Flash/Light packaging suggests Alibaba wants this to fit production deployments at different cost and latency points
// TAGS
qwen3.5-omnimultimodalspeechaudiovideoagentsearch
DISCOVERED
12d ago
2026-03-30
PUBLISHED
12d ago
2026-03-30
RELEVANCE
9/ 10
AUTHOR
Lopsided_Dot_4557