OPEN_SOURCE ↗
REDDIT · REDDIT// 29d agoOPENSOURCE RELEASE
Fish Audio S2 open-sources, tops all TTS benchmarks
Fish Audio has open-sourced S2, a 4B-parameter dual-autoregressive TTS model that beats ElevenLabs, Seed-TTS, and MiniMax on the Audio Turing Test and EmergentTTS-Eval. It supports inline natural language emotion tags, zero-shot voice cloning, multi-speaker dialogue generation, and 80+ languages with 100ms time-to-first-audio.
// ANALYSIS
S2 is the most credible open-source TTS challenger to ElevenLabs yet — and the inline freeform emotion tagging system (`[whispers sweetly]`, `[laughing nervously]`) is a genuine UX leap over rigid fixed-schema alternatives.
- –Beats every evaluated closed-source model on Seed-TTS WER (0.54% Chinese, 0.99% English) and EmergentTTS-Eval (81.88% win rate vs. gpt-4o-mini-tts)
- –Dual-AR architecture (Qwen3-4B slow AR + 400M fast AR) maps directly onto the LLM serving stack — SGLang integration means production streaming without custom infra
- –Zero-shot voice cloning from 10–30 second clips with 86.4% RadixAttention cache hit rates makes it practical for high-volume API workloads
- –Caveat: the Research License bars commercial use without a separate Fish Audio agreement — not truly permissive open-source, which disappointed parts of the community
- –Hardware ceiling is real: 12 GB VRAM minimum, real-time RTF only on H200-class GPUs; consumer setups will struggle
// TAGS
fish-audio-s2speechaudio-genopen-sourceopen-weightsfine-tuningbenchmark
DISCOVERED
29d ago
2026-03-14
PUBLISHED
32d ago
2026-03-11
RELEVANCE
8/ 10
AUTHOR
Hillvegxn