X · X// 3h agoMODEL RELEASE

Inworld Realtime TTS-2 adds voice direction

Inworld’s research-preview voice model is built for realtime conversation, conditioning on prior audio and plain-English voice direction instead of treating speech as isolated text-to-speech. It also keeps one speaker identity across 100+ languages and ships through the Inworld API and Realtime API.

// ANALYSIS

This is the right direction for voice AI: less “pretty TTS,” more controllable conversational state. The real shift is from static narration to a model that adapts to how the user actually sounds and how the developer wants the line delivered.

–Prior audio context lets the model carry tone, pacing, and emotional state across turns instead of only reading the current sentence
–Plain-English voice direction lowers the control burden; devs can steer delivery without wrestling with fixed emotion enums
–Crosslingual identity preservation matters for support, games, and companion apps that need one persona across markets
–The research-preview rollout makes this feel like a production API move, not just a demo tied to one flagship voice
–The main proof point now is reliability: latency, pronunciation edge cases, and expressive consistency in real apps

// TAGS

inworld-tts-2ttsspeechvoice-agentagentapiinfrastructure

DISCOVERED

3h ago

2026-05-06

PUBLISHED

3h ago

2026-05-06

RELEVANCE

8/ 10

AUTHOR

inworld_ai