BACK_TO_FEEDAICRIER_2
LongCat-AudioDiT lands with waveform latent TTS
OPEN_SOURCE ↗
REDDIT · REDDIT// 12d agoMODEL RELEASE

LongCat-AudioDiT lands with waveform latent TTS

LongCat-AudioDiT is Meituan LongCat’s open diffusion TTS model that generates speech directly in waveform latent space instead of mel-spectrograms. The 3.5B variant claims SOTA zero-shot voice cloning on Seed, with weights and inference code released on GitHub and Hugging Face.

// ANALYSIS

This is a meaningful speech-model release because it attacks TTS complexity at the representation level, not just by scaling up another pipeline.

  • Direct waveform-latent generation removes the mel bottleneck and should reduce compounding errors in long-form synthesis.
  • The APG guidance swap and training-inference mismatch fix look like the real quality wins, not just parameter count.
  • The 3.5B model’s Seed gains are modest but credible, especially for zero-shot cloning where speaker similarity matters a lot.
  • The paper’s Wav-VAE finding is important: better reconstructions do not automatically translate to better end-to-end TTS.
  • Open weights plus a Hugging Face-compatible implementation make this immediately useful for downstream speech tooling and fine-tuning experiments.
// TAGS
speechaudio-genresearchopen-sourcelongcat-audiodit

DISCOVERED

12d ago

2026-03-31

PUBLISHED

12d ago

2026-03-31

RELEVANCE

9/ 10

AUTHOR

fruesome