BACK_TO_FEEDAICRIER_2
Fish Audio S2 Pro streams at 380ms TTFA
OPEN_SOURCE ↗
REDDIT · REDDIT// 28d agoTUTORIAL

Fish Audio S2 Pro streams at 380ms TTFA

A community developer has implemented end-to-end streaming for Fish Audio's S2 Pro TTS model, cutting time-to-first-audio to ~380ms with torch.compile on an RTX 5090 — down from ~800ms without it. The working code is shared as a GitHub PR for others to build on.

// ANALYSIS

Getting a 4B-parameter dual-autoregressive TTS model to stream at sub-400ms TTFA on consumer hardware is a meaningful unlock for local voice AI deployments.

  • FishSpeech S2 Pro uses a Slow AR (4B) + Fast AR (400M) architecture, making streaming non-trivial — the vocoder (DAC) and LLM must be pipelined carefully to avoid blocking
  • torch.compile cuts latency roughly in half (800ms → 380ms), but recompilation on new input shapes is a known pain point taking ~6 minutes — a fixable but blocking issue for production use
  • The contributor notes DAC can process tokens independently with smarter scheduling, which could drop TTFA further without waiting for a full LLM pass
  • Key open issues: OOM on longer prompts (30–50 words), memory optimization, and CUDA graph integration — all solvable but require ML-level profiling
  • This is a community contribution, not an official Fish Audio release, but the maintainer provided guidance — a signal the team welcomes this direction
// TAGS
fish-audio-s2-prospeechaudio-genopen-sourceinference

DISCOVERED

28d ago

2026-03-15

PUBLISHED

28d ago

2026-03-15

RELEVANCE

7/ 10

AUTHOR

konovalov-nk