OPEN_SOURCE ↗
REDDIT · REDDIT// 28d agoTUTORIAL
llama.cpp Vulkan split offload hits OOM on B580
A user with an Intel B580 GPU and 32GB RAM is trying to run Qwen3-80B (Q3_K_M) using llama.cpp's Vulkan backend, splitting layers between VRAM and system RAM. Multiple configuration attempts result in either a "device lost" error with `--fit` flags or an out-of-memory crash when trying to force CPU offloading.
// ANALYSIS
This is a common pain point with Vulkan backends — memory management is less mature than CUDA, and split GPU/CPU offloading is notoriously finicky on non-NVIDIA hardware.
- –The `--fit` flag attempts auto-allocation but can overcommit VRAM on Intel Arc, triggering Vulkan device loss
- –`--no-mmap` forces the full model into RAM before layer splitting, which is the correct approach but requires careful `-ngl` tuning to avoid VRAM overflow
- –The correct approach is explicit `-ngl <n>` to control exact layer count on GPU, combined with `--no-mmap` and sufficient RAM headroom
- –Intel Arc B580 has 12GB VRAM; an 80B Q3_K_M model is ~35GB, so only a fraction of layers fit on GPU
- –Vulkan support in llama.cpp continues to lag behind CUDA for advanced memory features like unified addressing
// TAGS
llama-cppllminferencegpuedge-aiopen-source
DISCOVERED
28d ago
2026-03-15
PUBLISHED
28d ago
2026-03-15
RELEVANCE
5/ 10
AUTHOR
WizardlyBump17