BACK_TO_FEEDAICRIER_2
llama.cpp offload experiment hits AMD limits
OPEN_SOURCE ↗
REDDIT · REDDIT// 24d agoINFRASTRUCTURE

llama.cpp offload experiment hits AMD limits

A LocalLLaMA user asks whether llama.cpp can spread a 170GB Qwen3.5 397B model across AMD VRAM, system RAM, and SSD to squeeze out better throughput. The post captures the hard reality that memory-mapped loading and true fast offload are not the same thing, especially once disk becomes part of the inference path.

// ANALYSIS

llama.cpp can stretch surprisingly far across mixed hardware, but once SSD is doing real work during generation, you’re usually fighting bandwidth and page-fault latency instead of getting a clever “third tier” of memory. On AMD, the stack is viable, but the user’s 0.11 tok/s suggests they’ve already crossed into diminishing-return territory.

  • Official llama.cpp docs list HIP support for AMD GPUs, so the platform itself is not the blocker.
  • GitHub guidance on memory usage notes that `mmap` is the default, while `--no-mmap` and `--mlock` change paging behavior rather than making SSD a fast inference tier.
  • If the model exceeds fast memory, disabling `mmap` can actually prevent it from loading at all, which is a blunt reminder that “fit” and “run well” are different problems.
  • A 170GB model on 48GB VRAM plus 64GB RAM is fundamentally constrained by storage and host-memory bandwidth, so token speed will crater before compute saturates.
  • For this class of workload, smaller quantization, a smaller model, more VRAM, or distributed inference are usually more realistic wins than disk offloading.
// TAGS
llama-cppllminferencegpuopen-sourceself-hosted

DISCOVERED

24d ago

2026-03-19

PUBLISHED

24d ago

2026-03-19

RELEVANCE

7/ 10

AUTHOR

EmPips