OPEN_SOURCE ↗
REDDIT · REDDIT// 37d agoINFRASTRUCTURE
LLM Compressor users report multi-GPU AWQ OOMs
A Reddit post in r/LocalLLaMA says vLLM's LLM Compressor oneshot AWQ flow appears to collapse quantization onto a single GPU, triggering out-of-memory crashes even when models initially load across multiple GPUs. The linked docs note the example is single-process and point to separate distributed guidance, which likely explains the mismatch.
// ANALYSIS
This looks less like a one-off user mistake and more like a docs/UX gap around distributed quantization paths for large local setups.
- –The complaint targets `oneshot()` behavior under AWQ when VRAM is near single-GPU limits.
- –The referenced docs emphasize single-process quantization, which can surprise users expecting tensor-parallel-style behavior.
- –AutoAWQ deprecation sentiment shows migration friction for local inference users.
- –This is practical infrastructure pain for developers running multi-GPU quant workflows, not just a beginner setup issue.
// TAGS
llm-compressorvllmllminferencegpudevtool
DISCOVERED
37d ago
2026-03-05
PUBLISHED
37d ago
2026-03-05
RELEVANCE
7/ 10
AUTHOR
siegevjorn