OPEN_SOURCE ↗
REDDIT · REDDIT// 3d agoINFRASTRUCTURE
LM Studio, Ollama diverge on memory
A LocalLLaMA user says Ollama keeps a Gemma 4 85K run almost entirely on GPU across a mixed Nvidia setup, while LM Studio steadily shifts work into system RAM and drops throughput over repeated prompts. The question is whether LM Studio needs different offload settings or just handles long-context, multi-GPU scheduling less cleanly.
// ANALYSIS
This looks less like a raw VRAM shortage and more like two runtimes making different memory-placement decisions under long-context pressure. Ollama’s current scheduler is explicitly tuned for tighter memory accounting and multi-GPU behavior, while LM Studio’s dedicated-GPU cap can intentionally spill the overflow into host RAM.
- –LM Studio exposes GPU offload and context-length controls, so a strict "dedicated GPU memory" cap can leave the runtime room to push buffers or KV cache into system RAM
- –Ollama’s docs say to check `ollama ps` for the CPU/GPU split and note improved multi-GPU scheduling and memory reporting, which matches the user’s steadier `nvidia-smi` readout
- –At 85K-100K context, KV cache size becomes a first-order constraint, so small differences in how the runtime allocates cache and scratch space can cause big swings in RAM use and tok/s
- –Mixed-card systems are a good stress test, but they also make allocator behavior look like a "bug" when it may just be a different tradeoff between host RAM spillover and GPU saturation
- –If the goal is Ollama-like behavior, the likely knobs are max GPU offload, shorter context, and checking whether LM Studio is forcing a conservative offload policy for the selected engine
// TAGS
lm-studioollamallmgpuinferenceself-hosted
DISCOVERED
3d ago
2026-04-08
PUBLISHED
3d ago
2026-04-08
RELEVANCE
7/ 10
AUTHOR
pepedombo