OPEN_SOURCE ↗
GH · GITHUB// 14d agoOPENSOURCE RELEASE
Microsoft drops VibeVoice for long-form audio synthesis
Microsoft released VibeVoice, an open-source speech framework capable of processing and synthesizing up to 90 minutes of continuous, multi-speaker audio in a single pass. The system leverages continuous speech tokenizers and a next-token diffusion framework to achieve high-fidelity output on consumer-grade hardware.
// ANALYSIS
VibeVoice is a major milestone for open-source speech AI, offering a locally runnable alternative to proprietary giants like ElevenLabs.
- –3200x audio compression via novel continuous speech tokenizers enables long-form generation on hardware with as little as 8GB VRAM.
- –Native support for four distinct speakers with natural turn-taking simplifies the automation of podcasting and meeting transcription workflows.
- –While internal benchmarks claim parity with ElevenLabs V3, real-world users report occasional stability issues and "hallucinated" background artifacts.
- –The model's cross-lingual capabilities and low-latency realtime variant (0.5B) make it highly versatile for interactive agent applications.
- –Microsoft's decision to restrict TTS source code shortly after release highlights the ongoing friction between open research and deepfake safety concerns.
// TAGS
vibevoicemicrosoftspeechaudio-genopen-sourcellm
DISCOVERED
14d ago
2026-03-29
PUBLISHED
14d ago
2026-03-29
RELEVANCE
9/ 10