OPEN_SOURCE ↗
REDDIT · REDDIT// 12d agoMODEL RELEASE
LongCat-AudioDiT lands with waveform latent TTS
LongCat-AudioDiT is Meituan LongCat’s open diffusion TTS model that generates speech directly in waveform latent space instead of mel-spectrograms. The 3.5B variant claims SOTA zero-shot voice cloning on Seed, with weights and inference code released on GitHub and Hugging Face.
// ANALYSIS
This is a meaningful speech-model release because it attacks TTS complexity at the representation level, not just by scaling up another pipeline.
- –Direct waveform-latent generation removes the mel bottleneck and should reduce compounding errors in long-form synthesis.
- –The APG guidance swap and training-inference mismatch fix look like the real quality wins, not just parameter count.
- –The 3.5B model’s Seed gains are modest but credible, especially for zero-shot cloning where speaker similarity matters a lot.
- –The paper’s Wav-VAE finding is important: better reconstructions do not automatically translate to better end-to-end TTS.
- –Open weights plus a Hugging Face-compatible implementation make this immediately useful for downstream speech tooling and fine-tuning experiments.
// TAGS
speechaudio-genresearchopen-sourcelongcat-audiodit
DISCOVERED
12d ago
2026-03-31
PUBLISHED
12d ago
2026-03-31
RELEVANCE
9/ 10
AUTHOR
fruesome