OPEN_SOURCE ↗
REDDIT · REDDIT// 28d agoTUTORIAL
Fish Audio S2 Pro streams at 380ms TTFA
A community developer has implemented end-to-end streaming for Fish Audio's S2 Pro TTS model, cutting time-to-first-audio to ~380ms with torch.compile on an RTX 5090 — down from ~800ms without it. The working code is shared as a GitHub PR for others to build on.
// ANALYSIS
Getting a 4B-parameter dual-autoregressive TTS model to stream at sub-400ms TTFA on consumer hardware is a meaningful unlock for local voice AI deployments.
- –FishSpeech S2 Pro uses a Slow AR (4B) + Fast AR (400M) architecture, making streaming non-trivial — the vocoder (DAC) and LLM must be pipelined carefully to avoid blocking
- –torch.compile cuts latency roughly in half (800ms → 380ms), but recompilation on new input shapes is a known pain point taking ~6 minutes — a fixable but blocking issue for production use
- –The contributor notes DAC can process tokens independently with smarter scheduling, which could drop TTFA further without waiting for a full LLM pass
- –Key open issues: OOM on longer prompts (30–50 words), memory optimization, and CUDA graph integration — all solvable but require ML-level profiling
- –This is a community contribution, not an official Fish Audio release, but the maintainer provided guidance — a signal the team welcomes this direction
// TAGS
fish-audio-s2-prospeechaudio-genopen-sourceinference
DISCOVERED
28d ago
2026-03-15
PUBLISHED
28d ago
2026-03-15
RELEVANCE
7/ 10
AUTHOR
konovalov-nk