OPEN_SOURCE ↗
REDDIT · REDDIT// 14d agoINFRASTRUCTURE
VibeVoice ASR hits 24GB vLLM wall
A LocalLLaMA user says Microsoft's VibeVoice ASR works in Transformers but still blows past 24GB VRAM in vLLM. They're asking whether anyone has a stable single-GPU recipe for the 9B, long-form speech-to-text model.
// ANALYSIS
The interesting part is that this looks less like a vLLM bug than a reminder that “supported” and “comfortable” are different things. Microsoft now ships an official vLLM deployment guide, but it immediately reaches for memory tuning and multi-GPU options, which is a clue that 24GB is tight for this workload.
- –The model handles 60-minute, 64K-token ASR, so KV-cache pressure is likely as painful as the 9B weights themselves.
- –The official guide suggests knobs like `--gpu-memory-utilization`, `--max-num-seqs`, `--max-model-len`, and `PYTORCH_ALLOC_CONF=expandable_segments:True`.
- –Microsoft documents tensor parallel and data parallel setups, which reads like an admission that single-card deployment is not the happy path.
- –If your goal is just local transcription, Transformers may be the simpler fit; vLLM mainly buys you serving throughput and API compatibility.
- –VibeVoice ASR is doing diarization and hotword-aware transcription too, so squeezing memory often means accepting tradeoffs elsewhere.
// TAGS
speechinferencegpullmopen-sourcevibevoice-asr
DISCOVERED
14d ago
2026-03-28
PUBLISHED
14d ago
2026-03-28
RELEVANCE
8/ 10
AUTHOR
GotHereLateNameTaken