OPEN_SOURCE ↗
REDDIT · REDDIT// 34d agoINFRASTRUCTURE
Ollama users want smarter AMD offloading
A Reddit thread in r/LocalLLaMA is asking for an LLM server on AMD/ROCm that can keep multiple models resident by filling GPU VRAM first, then spilling overflow layers to CPU instead of fully evicting loaded models. The author says Ollama handles multi-model loading only while everything fits in VRAM, which makes mixed workloads like pairing a large reasoning model with a smaller background model awkward to manage.
// ANALYSIS
This is a real local-inference infrastructure gap: most "easy" model runners still behave like single-model launchers, not schedulers that can intelligently tier workloads across GPU and system RAM.
- –The post describes a concrete limitation in Ollama today: once the next model no longer fits in VRAM, it unloads other models instead of partially offloading layers to CPU.
- –That behavior is especially painful for agent-style setups where one heavyweight model handles reasoning while smaller models do extraction, summarization, or utility work in parallel.
- –The AMD/ROCm angle matters because users often prioritize llama.cpp stability on Radeon hardware, even when other servers may look stronger on paper.
- –The only reply points to `llama.cpp` plus `llama-swap` over Vulkan as the most practical workaround, which suggests advanced users still need lower-level tooling for smarter residency control.
// TAGS
ollamallminferencegpuself-hosted
DISCOVERED
34d ago
2026-03-09
PUBLISHED
34d ago
2026-03-09
RELEVANCE
7/ 10
AUTHOR
Di_Vante