YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

VibeVoice ASR hits 24GB vLLM wall

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

VibeVoice ASR hits 24GB vLLM wall
OPEN LINK ↗
// 60d agoINFRASTRUCTURE

VibeVoice ASR hits 24GB vLLM wall

A LocalLLaMA user says Microsoft's VibeVoice ASR works in Transformers but still blows past 24GB VRAM in vLLM. They're asking whether anyone has a stable single-GPU recipe for the 9B, long-form speech-to-text model.

// ANALYSIS

The interesting part is that this looks less like a vLLM bug than a reminder that “supported” and “comfortable” are different things. Microsoft now ships an official vLLM deployment guide, but it immediately reaches for memory tuning and multi-GPU options, which is a clue that 24GB is tight for this workload.

  • The model handles 60-minute, 64K-token ASR, so KV-cache pressure is likely as painful as the 9B weights themselves.
  • The official guide suggests knobs like `--gpu-memory-utilization`, `--max-num-seqs`, `--max-model-len`, and `PYTORCH_ALLOC_CONF=expandable_segments:True`.
  • Microsoft documents tensor parallel and data parallel setups, which reads like an admission that single-card deployment is not the happy path.
  • If your goal is just local transcription, Transformers may be the simpler fit; vLLM mainly buys you serving throughput and API compatibility.
  • VibeVoice ASR is doing diarization and hotword-aware transcription too, so squeezing memory often means accepting tradeoffs elsewhere.
// TAGS
speechinferencegpullmopen-sourcevibevoice-asr

DISCOVERED

60d ago

2026-03-28

PUBLISHED

60d ago

2026-03-28

RELEVANCE

8/ 10

AUTHOR

GotHereLateNameTaken