BACK_TO_FEEDAICRIER_2
VibeVoice ASR hits 24GB vLLM wall
OPEN_SOURCE ↗
REDDIT · REDDIT// 14d agoINFRASTRUCTURE

VibeVoice ASR hits 24GB vLLM wall

A LocalLLaMA user says Microsoft's VibeVoice ASR works in Transformers but still blows past 24GB VRAM in vLLM. They're asking whether anyone has a stable single-GPU recipe for the 9B, long-form speech-to-text model.

// ANALYSIS

The interesting part is that this looks less like a vLLM bug than a reminder that “supported” and “comfortable” are different things. Microsoft now ships an official vLLM deployment guide, but it immediately reaches for memory tuning and multi-GPU options, which is a clue that 24GB is tight for this workload.

  • The model handles 60-minute, 64K-token ASR, so KV-cache pressure is likely as painful as the 9B weights themselves.
  • The official guide suggests knobs like `--gpu-memory-utilization`, `--max-num-seqs`, `--max-model-len`, and `PYTORCH_ALLOC_CONF=expandable_segments:True`.
  • Microsoft documents tensor parallel and data parallel setups, which reads like an admission that single-card deployment is not the happy path.
  • If your goal is just local transcription, Transformers may be the simpler fit; vLLM mainly buys you serving throughput and API compatibility.
  • VibeVoice ASR is doing diarization and hotword-aware transcription too, so squeezing memory often means accepting tradeoffs elsewhere.
// TAGS
speechinferencegpullmopen-sourcevibevoice-asr

DISCOVERED

14d ago

2026-03-28

PUBLISHED

14d ago

2026-03-28

RELEVANCE

8/ 10

AUTHOR

GotHereLateNameTaken