BACK_TO_FEEDAICRIER_2
llama.cpp Vulkan split offload hits OOM on B580
OPEN_SOURCE ↗
REDDIT · REDDIT// 28d agoTUTORIAL

llama.cpp Vulkan split offload hits OOM on B580

A user with an Intel B580 GPU and 32GB RAM is trying to run Qwen3-80B (Q3_K_M) using llama.cpp's Vulkan backend, splitting layers between VRAM and system RAM. Multiple configuration attempts result in either a "device lost" error with `--fit` flags or an out-of-memory crash when trying to force CPU offloading.

// ANALYSIS

This is a common pain point with Vulkan backends — memory management is less mature than CUDA, and split GPU/CPU offloading is notoriously finicky on non-NVIDIA hardware.

  • The `--fit` flag attempts auto-allocation but can overcommit VRAM on Intel Arc, triggering Vulkan device loss
  • `--no-mmap` forces the full model into RAM before layer splitting, which is the correct approach but requires careful `-ngl` tuning to avoid VRAM overflow
  • The correct approach is explicit `-ngl <n>` to control exact layer count on GPU, combined with `--no-mmap` and sufficient RAM headroom
  • Intel Arc B580 has 12GB VRAM; an 80B Q3_K_M model is ~35GB, so only a fraction of layers fit on GPU
  • Vulkan support in llama.cpp continues to lag behind CUDA for advanced memory features like unified addressing
// TAGS
llama-cppllminferencegpuedge-aiopen-source

DISCOVERED

28d ago

2026-03-15

PUBLISHED

28d ago

2026-03-15

RELEVANCE

5/ 10

AUTHOR

WizardlyBump17