OPEN_SOURCE ↗
REDDIT · REDDIT// 19d agoOPENSOURCE RELEASE
CosyVoice 3 setup woes, voice cloning drops words
CosyVoice 3 is pitched as a zero-shot multilingual TTS model with low-latency streaming, but a Reddit user says the local install is still brittle and the generated speech can skip or reorder words. The thread captures the gap between a strong demo and a clean local voice-cloning workflow.
// ANALYSIS
CosyVoice 3 looks powerful, but it still behaves like research code wrapped in a product pitch. The hard part is less installing the model and more matching the exact variant, prompt format, and serving path.
- –The repo specifically recommends the `Fun-CosyVoice3-0.5B` checkpoint for better performance, but the install path still starts with `git clone --recursive`, a Python 3.10 conda env, and optional `ttsfrd` normalization, which is a lot for newcomers ([CosyVoice repo](https://github.com/FunAudioLLM/CosyVoice)).
- –The Hugging Face model card uses an assistant-style prefix plus an explicit end-of-prompt token, so the published examples are more structured than a plain text-in, audio-out call ([model card](https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512)).
- –The repo issue tracker has a report of missing words and word-order drift in `inference_zero_shot`, which lines up closely with the failure mode described in the Reddit post ([issue #1302](https://github.com/FunAudioLLM/CosyVoice/issues/1302)).
- –CosyVoice 3 does offer vLLM and TensorRT-LLM deployment paths, but that reinforces the point: the speed story is a serving problem as much as a model problem ([CosyVoice repo](https://github.com/FunAudioLLM/CosyVoice)).
// TAGS
speechaudio-genllmopen-sourceself-hostedinferencecosyvoice-3
DISCOVERED
19d ago
2026-03-24
PUBLISHED
19d ago
2026-03-23
RELEVANCE
8/ 10
AUTHOR
SciData777