YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

LongCat-AudioDiT lands with waveform latent TTS

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

LongCat-AudioDiT lands with waveform latent TTS
OPEN LINK ↗
// 58d agoMODEL RELEASE

LongCat-AudioDiT lands with waveform latent TTS

LongCat-AudioDiT is Meituan LongCat’s open diffusion TTS model that generates speech directly in waveform latent space instead of mel-spectrograms. The 3.5B variant claims SOTA zero-shot voice cloning on Seed, with weights and inference code released on GitHub and Hugging Face.

// ANALYSIS

This is a meaningful speech-model release because it attacks TTS complexity at the representation level, not just by scaling up another pipeline.

  • Direct waveform-latent generation removes the mel bottleneck and should reduce compounding errors in long-form synthesis.
  • The APG guidance swap and training-inference mismatch fix look like the real quality wins, not just parameter count.
  • The 3.5B model’s Seed gains are modest but credible, especially for zero-shot cloning where speaker similarity matters a lot.
  • The paper’s Wav-VAE finding is important: better reconstructions do not automatically translate to better end-to-end TTS.
  • Open weights plus a Hugging Face-compatible implementation make this immediately useful for downstream speech tooling and fine-tuning experiments.
// TAGS
speechaudio-genresearchopen-sourcelongcat-audiodit

DISCOVERED

58d ago

2026-03-31

PUBLISHED

58d ago

2026-03-31

RELEVANCE

9/ 10

AUTHOR

fruesome