Scenema Audio drops open-source emotional voice cloning
Scenema Audio is a new open-source, zero-shot voice cloning model that decouples voice identity from emotional performance. Built on Gemma 3 and an LTX diffusion transformer, it uses XML-style stage directions to make any cloned voice perform complex emotions like rage or grief alongside scene-aware background audio.
Scenema shifts the TTS focus from mere phonetic accuracy to actual acting, solving the persistent "robotic" feel of most open-source audio generators. Decoupling identity from emotion means you can clone a flat 10-second reference clip and make that voice scream, whisper, or cry. XML-based action tags give creators fine-grained control over mid-sentence emotional shifts, pacing, and breath control. By co-generating speech and ambient environmental audio in a single pass, it drastically simplifies audio-first video generation workflows. The 16GB VRAM requirement makes this high-fidelity, performative audio accessible to developers on consumer hardware.
DISCOVERED
1h ago
2026-05-17
PUBLISHED
2h ago
2026-05-17
RELEVANCE
AUTHOR
AI Search