OPEN_SOURCE ↗
REDDIT · REDDIT// 24d agoINFRASTRUCTURE
llama.cpp offload experiment hits AMD limits
A LocalLLaMA user asks whether llama.cpp can spread a 170GB Qwen3.5 397B model across AMD VRAM, system RAM, and SSD to squeeze out better throughput. The post captures the hard reality that memory-mapped loading and true fast offload are not the same thing, especially once disk becomes part of the inference path.
// ANALYSIS
llama.cpp can stretch surprisingly far across mixed hardware, but once SSD is doing real work during generation, you’re usually fighting bandwidth and page-fault latency instead of getting a clever “third tier” of memory. On AMD, the stack is viable, but the user’s 0.11 tok/s suggests they’ve already crossed into diminishing-return territory.
- –Official llama.cpp docs list HIP support for AMD GPUs, so the platform itself is not the blocker.
- –GitHub guidance on memory usage notes that `mmap` is the default, while `--no-mmap` and `--mlock` change paging behavior rather than making SSD a fast inference tier.
- –If the model exceeds fast memory, disabling `mmap` can actually prevent it from loading at all, which is a blunt reminder that “fit” and “run well” are different problems.
- –A 170GB model on 48GB VRAM plus 64GB RAM is fundamentally constrained by storage and host-memory bandwidth, so token speed will crater before compute saturates.
- –For this class of workload, smaller quantization, a smaller model, more VRAM, or distributed inference are usually more realistic wins than disk offloading.
// TAGS
llama-cppllminferencegpuopen-sourceself-hosted
DISCOVERED
24d ago
2026-03-19
PUBLISHED
24d ago
2026-03-19
RELEVANCE
7/ 10
AUTHOR
EmPips