LongCat-AudioDiT lands with waveform latent TTS

// 58d agoMODEL RELEASE

LongCat-AudioDiT lands with waveform latent TTS

LongCat-AudioDiT is Meituan LongCat’s open diffusion TTS model that generates speech directly in waveform latent space instead of mel-spectrograms. The 3.5B variant claims SOTA zero-shot voice cloning on Seed, with weights and inference code released on GitHub and Hugging Face.

// ANALYSIS

This is a meaningful speech-model release because it attacks TTS complexity at the representation level, not just by scaling up another pipeline.

–Direct waveform-latent generation removes the mel bottleneck and should reduce compounding errors in long-form synthesis.
–The APG guidance swap and training-inference mismatch fix look like the real quality wins, not just parameter count.
–The 3.5B model’s Seed gains are modest but credible, especially for zero-shot cloning where speaker similarity matters a lot.
–The paper’s Wav-VAE finding is important: better reconstructions do not automatically translate to better end-to-end TTS.
–Open weights plus a Hugging Face-compatible implementation make this immediately useful for downstream speech tooling and fine-tuning experiments.

// TAGS

speechaudio-genresearchopen-sourcelongcat-audiodit

DISCOVERED

58d ago

2026-03-31

PUBLISHED

58d ago

2026-03-31

RELEVANCE

9/ 10

AUTHOR

fruesome

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL25m ago

Gemini 3.5 Flash powers Archon UI design

Google's latest 3.5 Flash model integrates with the Archon coding harness to deliver high-fidelity frontend designs via specialized agentic workflows. The model features a 1M context window and optimized reasoning for autonomous, multi-step development tasks.

NEWS25m ago

BridgeMind hits $193K ARR via vibe coding

BridgeMind AI founder Matthew Miller reports reaching $193,248 in Annual Recurring Revenue as part of his "vibe coding" challenge. The project demonstrates the commercial viability of "agentic organizations" where small teams leverage autonomous AI agents to ship and scale production software at high velocity.

LAUNCH36m ago

Klap repurposes long videos into Shorts

Klap is an AI video repurposing tool that turns long YouTube videos into short-form clips for TikTok, Instagram Reels, and YouTube Shorts. Its core pitch is speed: it detects strong moments, crops for vertical format, and adds captions so creators can publish short clips with far less manual editing.