BACK_TO_FEEDAICRIER_2
Open WebUI RAG Eats VRAM, Model Hits RAM
OPEN_SOURCE ↗
REDDIT · REDDIT// 7h agoINFRASTRUCTURE

Open WebUI RAG Eats VRAM, Model Hits RAM

A user reports that routing LM Studio through Open WebUI changes memory behavior: the model ends up in system RAM while Open WebUI itself reserves GPU memory, leaving much of the 16 GB VRAM unused. The post points to a likely interaction between Open WebUI’s local RAG/embedding stack and the remote LM Studio server, rather than a simple model-size issue.

// ANALYSIS

This reads less like “the model is too big” and more like Open WebUI is inserting its own local inference path into the request chain, which can steal GPU resources from the actual model server.

  • Open WebUI’s docs explicitly say it loads local ML models for RAG and STT, and recommend offloading embeddings to an external service to avoid RAM pressure.
  • The project also documents that OpenAI-compatible backends like LM Studio are supported, but that only covers the chat API contract, not how Open WebUI handles its own retrieval pipeline.
  • There’s prior community reporting that RAG can reserve 2 to 4 GB of VRAM inside Open WebUI, which matches the “2 GB of something else” symptom described here.
  • If disabling RAG truly changes nothing, the next likely culprit is the Open WebUI Docker image or embedding configuration still using a local CUDA-backed path.
  • Net: this looks like a real integration quirk, not user incompetence, and it’s exactly the kind of VRAM contention that makes self-hosted AI stacks feel brittle.
// TAGS
open-webuilm-studioraginferencegpuself-hostedai-infrastructure

DISCOVERED

7h ago

2026-04-18

PUBLISHED

8h ago

2026-04-18

RELEVANCE

8/ 10

AUTHOR

Dekatater