llama.cpp offload experiment hits AMD limits

// 70d agoINFRASTRUCTURE

llama.cpp offload experiment hits AMD limits

A LocalLLaMA user asks whether llama.cpp can spread a 170GB Qwen3.5 397B model across AMD VRAM, system RAM, and SSD to squeeze out better throughput. The post captures the hard reality that memory-mapped loading and true fast offload are not the same thing, especially once disk becomes part of the inference path.

// ANALYSIS

llama.cpp can stretch surprisingly far across mixed hardware, but once SSD is doing real work during generation, you’re usually fighting bandwidth and page-fault latency instead of getting a clever “third tier” of memory. On AMD, the stack is viable, but the user’s 0.11 tok/s suggests they’ve already crossed into diminishing-return territory.

–Official llama.cpp docs list HIP support for AMD GPUs, so the platform itself is not the blocker.
–GitHub guidance on memory usage notes that `mmap` is the default, while `--no-mmap` and `--mlock` change paging behavior rather than making SSD a fast inference tier.
–If the model exceeds fast memory, disabling `mmap` can actually prevent it from loading at all, which is a blunt reminder that “fit” and “run well” are different problems.
–A 170GB model on 48GB VRAM plus 64GB RAM is fundamentally constrained by storage and host-memory bandwidth, so token speed will crater before compute saturates.
–For this class of workload, smaller quantization, a smaller model, more VRAM, or distributed inference are usually more realistic wins than disk offloading.

// TAGS

llama-cppllminferencegpuopen-sourceself-hosted

DISCOVERED

70d ago

2026-03-19

PUBLISHED

70d ago

2026-03-19

RELEVANCE

7/ 10

AUTHOR

EmPips

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

INFRA45m ago

Cloudflare unveils Town Lake, Skipper AI agent

Cloudflare unveils its internal unified data platform, Town Lake, alongside Skipper, an AI agent that enables natural language queries across disparate datasets while maintaining strict governance. Built on Apache Trino and Iceberg, it solves the "data sprawl" problem that hobbles most enterprise AI initiatives.

INFRA47m ago

Tailscale makes Redpoint’s 2026 InfraRed 100

Tailscale has been recognized in Redpoint’s 2026 InfraRed 100, an annual list honoring 100 of the most promising private companies in AI infrastructure. The zero-trust networking platform is cited as a foundational layer for securing distributed AI workloads and providing the essential "connective tissue" for the emerging agentic era.

NEWS1h ago

Claude powers Polymarket arbitrage workflows

A viral retweet frames Claude as a practical tool for trading-adjacent automation, specifically analyzing mispriced Polymarket markets to surface arbitrage opportunities. The post is less a product launch than a signal of how users are adopting Claude for high-leverage, semi-structured research tasks that combine reasoning, pattern matching, and market scanning.