Wan-Streamer launches real-time multimodal interaction
Wan-AI releases Wan-Streamer v0.1, a single-Transformer foundation model built from the ground up for low-latency, full-duplex audio-visual communication. By integrating perception, reasoning, and synthesis, it achieves a ~200 ms model-side latency and enables fluid 25 fps interaction without cascaded pipeline delays.
Cascaded voice-agent pipelines are dead; Wan-Streamer demonstrates that end-to-end native streaming is the only viable path to true real-time, human-like AI interaction. By processing audio and video tokens interleaved in a single Transformer, it solves the latency and error accumulation issues that plague traditional multi-step systems.
- –Unified Architecture: Eliminates separate VAD, ASR, LLM, TTS, and video generation steps, training all modalities inside a single Transformer model.
- –Low-latency Streaming: Redesigns the stack around block-causal attention and streaming token scheduling, delivering ~200 ms model-side response latency.
- –High-frequency Video: Streams visual and auditory modalities at 25 fps with streaming units as short as 160 ms.
- –Native Cross-modal Sync: Learns turn management and multimodal coordination end-to-end rather than engineering rules across cascaded API blocks.
DISCOVERED
1h ago
2026-06-25
PUBLISHED
2h ago
2026-06-25
RELEVANCE
AUTHOR
_akhaliq
