llama.cpp offload experiment hits AMD limits
A LocalLLaMA user asks whether llama.cpp can spread a 170GB Qwen3.5 397B model across AMD VRAM, system RAM, and SSD to squeeze out better throughput. The post captures the hard reality that memory-mapped loading and true fast offload are not the same thing, especially once disk becomes part of the inference path.
llama.cpp can stretch surprisingly far across mixed hardware, but once SSD is doing real work during generation, you’re usually fighting bandwidth and page-fault latency instead of getting a clever “third tier” of memory. On AMD, the stack is viable, but the user’s 0.11 tok/s suggests they’ve already crossed into diminishing-return territory.
- –Official llama.cpp docs list HIP support for AMD GPUs, so the platform itself is not the blocker.
- –GitHub guidance on memory usage notes that `mmap` is the default, while `--no-mmap` and `--mlock` change paging behavior rather than making SSD a fast inference tier.
- –If the model exceeds fast memory, disabling `mmap` can actually prevent it from loading at all, which is a blunt reminder that “fit” and “run well” are different problems.
- –A 170GB model on 48GB VRAM plus 64GB RAM is fundamentally constrained by storage and host-memory bandwidth, so token speed will crater before compute saturates.
- –For this class of workload, smaller quantization, a smaller model, more VRAM, or distributed inference are usually more realistic wins than disk offloading.
DISCOVERED
70d ago
2026-03-19
PUBLISHED
70d ago
2026-03-19
RELEVANCE
AUTHOR
EmPips