LM Studio, Ollama diverge on memory

// 95d agoINFRASTRUCTURE

LM Studio, Ollama diverge on memory

A LocalLLaMA user says Ollama keeps a Gemma 4 85K run almost entirely on GPU across a mixed Nvidia setup, while LM Studio steadily shifts work into system RAM and drops throughput over repeated prompts. The question is whether LM Studio needs different offload settings or just handles long-context, multi-GPU scheduling less cleanly.

// ANALYSIS

This looks less like a raw VRAM shortage and more like two runtimes making different memory-placement decisions under long-context pressure. Ollama’s current scheduler is explicitly tuned for tighter memory accounting and multi-GPU behavior, while LM Studio’s dedicated-GPU cap can intentionally spill the overflow into host RAM.

–LM Studio exposes GPU offload and context-length controls, so a strict "dedicated GPU memory" cap can leave the runtime room to push buffers or KV cache into system RAM
–Ollama’s docs say to check `ollama ps` for the CPU/GPU split and note improved multi-GPU scheduling and memory reporting, which matches the user’s steadier `nvidia-smi` readout
–At 85K-100K context, KV cache size becomes a first-order constraint, so small differences in how the runtime allocates cache and scratch space can cause big swings in RAM use and tok/s
–Mixed-card systems are a good stress test, but they also make allocator behavior look like a "bug" when it may just be a different tradeoff between host RAM spillover and GPU saturation
–If the goal is Ollama-like behavior, the likely knobs are max GPU offload, shorter context, and checking whether LM Studio is forcing a conservative offload policy for the selected engine

// TAGS

lm-studioollamallmgpuinferenceself-hosted

DISCOVERED

95d ago

2026-04-08

PUBLISHED

95d ago

2026-04-08

RELEVANCE

7/ 10

AUTHOR

pepedombo

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE59m ago

Perplexity Computer integrates Grok 4.5

Perplexity has integrated xAI's Grok 4.5 as the orchestrator for Perplexity Computer, achieving a top score of 0.328 on its internal WANDR benchmark. The integration is highly cost-effective, running at approximately half the cost of Anthropic's Claude Opus 4.8.

UPDATE1h ago

Inference optimizations boost GPT-5.6 Sol usage limits

Recent updates for Codex and ChatGPT Work have introduced inference optimizations, the savings of which are being passed directly to users. This results in approximately 10% more usage for all GPT-5.6 Sol subscriptions, with an emphasis on providing improvements without any feature restrictions.

UPDATE2h ago

Claude Code ignores admin SCIM plugin policies

An enterprise user highlighted a critical gap where marketplace plugin selection policies configured in the Claude Admin panel and mapped to SCIM groups do not sync or apply to Claude Code. This limitation breaks the centralized context administration model for organizations attempting broad, secure deployments of Claude across developer environments, as the CLI continues to rely on localized configuration controls instead of real-time organization policies.