Vexa weighs Parakeet, Voxtral for live transcripts
Vexa, the open-source meeting transcription API for Google Meet, Microsoft Teams, and Zoom, is asking for production feedback as it benchmarks Parakeet-TDT, Voxtral Mini, and VibeVoice against Whisper large-v3-turbo for real-time meeting transcription. The team is focused on streaming behavior, multilingual accuracy, operational surprises, GPU footprint, and whether CTC/transducer models really eliminate silence hallucinations in production.
This is the kind of infrastructure question that matters more than leaderboard wins: Vexa is testing where speech models break in real deployments, not just which one tops a benchmark. It also shows how quickly teams shipping Whisper into production are running into edge cases that push them toward streaming-first ASR alternatives.
- –Vexa already runs sub-second transcript streaming over WebSockets, so the evaluation is about end-to-end behavior under production constraints rather than toy demos
- –NVIDIA positions Parakeet v2 as a high-speed, high-accuracy ASR model, but its English-first framing leaves multilingual coverage as a real risk for Vexa’s Croatian, Latvian, Finnish, and French users
- –Mistral markets Voxtral as outperforming Whisper large-v3 on speech tasks, but Vexa is explicitly asking for latency, memory, and failure-mode data that vendor benchmarks rarely show
- –The silence-hallucination angle is the sharpest part of the post: if CTC/transducer models really avoid Whisper’s dead-air failure mode, that is a meaningful operational win for live meeting products
- –Because Vexa supports both self-hosters on consumer GPUs and larger cluster deployments, model size alone is not enough; runtime characteristics, batching behavior, and degradation under load will decide what actually ships
DISCOVERED
34d ago
2026-03-08
PUBLISHED
34d ago
2026-03-08
RELEVANCE
AUTHOR
Aggravating-Gap7783