LLM Compressor users report multi-GPU AWQ OOMs
A Reddit post in r/LocalLLaMA says vLLM's LLM Compressor oneshot AWQ flow appears to collapse quantization onto a single GPU, triggering out-of-memory crashes even when models initially load across multiple GPUs. The linked docs note the example is single-process and point to separate distributed guidance, which likely explains the mismatch.
This looks less like a one-off user mistake and more like a docs/UX gap around distributed quantization paths for large local setups.
- –The complaint targets `oneshot()` behavior under AWQ when VRAM is near single-GPU limits.
- –The referenced docs emphasize single-process quantization, which can surprise users expecting tensor-parallel-style behavior.
- –AutoAWQ deprecation sentiment shows migration friction for local inference users.
- –This is practical infrastructure pain for developers running multi-GPU quant workflows, not just a beginner setup issue.
DISCOVERED
83d ago
2026-03-05
PUBLISHED
83d ago
2026-03-05
RELEVANCE
AUTHOR
siegevjorn