Fish Audio S2 open-sources expressive TTS
Fish Audio has open-sourced S2, a new text-to-speech model with natural-language emotion control, native multi-speaker generation, and a production-minded streaming stack built on SGLang. It stands out because Fish shipped a full developer package — model weights, fine-tuning code, API access, benchmarks, and self-hosting paths — instead of just a flashy demo.
This is one of the more serious voice AI releases of the year: Fish Audio is pitching S2 as an actually deployable stack, not just a consumer voice toy. The caveat is that the release uses a research license for free use, so teams need to check commercial terms before treating it like a fully permissive open-source drop-in.
- –Natural-language inline tags like [whisper], [laugh], and custom prosody cues make S2 much more steerable than TTS systems built around fixed emotion presets
- –Fish is leaning hard on performance as a differentiator, claiming roughly 100 ms time-to-first-audio and strong streaming throughput on H200 hardware
- –The company says S2 posts leading results on Seed-TTS Eval and EmergentTTS-Eval, beating Seed-TTS, MiniMax Speech, and a GPT-4o-mini-tts baseline on multiple measures
- –Shipping the GitHub repo, Hugging Face weights, blog writeup, and hosted API together gives developers multiple adoption paths: experiment locally, self-host, or just call the service
DISCOVERED
31d ago
2026-03-11
PUBLISHED
33d ago
2026-03-10
RELEVANCE
AUTHOR
[REDACTED]