BACK_TO_FEEDAICRIER_2
CosyVoice 3 setup woes, voice cloning drops words
OPEN_SOURCE ↗
REDDIT · REDDIT// 19d agoOPENSOURCE RELEASE

CosyVoice 3 setup woes, voice cloning drops words

CosyVoice 3 is pitched as a zero-shot multilingual TTS model with low-latency streaming, but a Reddit user says the local install is still brittle and the generated speech can skip or reorder words. The thread captures the gap between a strong demo and a clean local voice-cloning workflow.

// ANALYSIS

CosyVoice 3 looks powerful, but it still behaves like research code wrapped in a product pitch. The hard part is less installing the model and more matching the exact variant, prompt format, and serving path.

  • The repo specifically recommends the `Fun-CosyVoice3-0.5B` checkpoint for better performance, but the install path still starts with `git clone --recursive`, a Python 3.10 conda env, and optional `ttsfrd` normalization, which is a lot for newcomers ([CosyVoice repo](https://github.com/FunAudioLLM/CosyVoice)).
  • The Hugging Face model card uses an assistant-style prefix plus an explicit end-of-prompt token, so the published examples are more structured than a plain text-in, audio-out call ([model card](https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512)).
  • The repo issue tracker has a report of missing words and word-order drift in `inference_zero_shot`, which lines up closely with the failure mode described in the Reddit post ([issue #1302](https://github.com/FunAudioLLM/CosyVoice/issues/1302)).
  • CosyVoice 3 does offer vLLM and TensorRT-LLM deployment paths, but that reinforces the point: the speed story is a serving problem as much as a model problem ([CosyVoice repo](https://github.com/FunAudioLLM/CosyVoice)).
// TAGS
speechaudio-genllmopen-sourceself-hostedinferencecosyvoice-3

DISCOVERED

19d ago

2026-03-24

PUBLISHED

19d ago

2026-03-23

RELEVANCE

8/ 10

AUTHOR

SciData777