OPEN_SOURCE ↗
REDDIT · REDDIT// 8d agoBENCHMARK RESULT
Batched Qwen3-TTS server tops 16.59x real-time
concurrent-faster-qwen3-server is an open-source Rust TTS server for Qwen3-TTS-12Hz-0.6B-Base that focuses on high-throughput concurrent inference. The project turns an upstream single-stream engine into a batched serving stack with voice cloning, streaming output, adaptive batching, and OOM recovery. According to the repo, it reaches 16.59x real-time at batch 16 on a single NVIDIA L4, with a claim of 18x improvement over the upstream engine in the headline benchmark path.
// ANALYSIS
Strong infra win with a clear production angle: the interesting part is not raw model quality, but making a sequential TTS stack behave like a multi-tenant service on one GPU.
- –The core improvement is batching across the full generation path, including autoregressive decoding, vocoder decoding, and streaming workers.
- –The repo frames this as a call-center style workload, which makes the latency-throughput tradeoff more credible than a toy benchmark.
- –The L4 numbers are practical: low idle VRAM, 450ms time-to-first-audio, and 60-80 simultaneous calls in the stated scenario.
- –Caveat: the benchmark is repo-authored, so it should be treated as a strong vendor-style claim until reproduced independently.
// TAGS
ttsinferencebatchingrustqwen3nvidia l4voice-cloningstreamingopen-sourcegpu-serving
DISCOVERED
8d ago
2026-04-03
PUBLISHED
8d ago
2026-04-03
RELEVANCE
9/ 10
AUTHOR
alfonsodlg