BACK_TO_FEEDAICRIER_2
Ollama API Calls Lag Interactive Chats
OPEN_SOURCE ↗
REDDIT · REDDIT// 25d agoINFRASTRUCTURE

Ollama API Calls Lag Interactive Chats

A Reddit user says the same Qwen 3.5 prompt runs 30-40x slower through Ollama’s API than in the interactive shell while generating JSON for a database workflow. The gap sounds more like request handling, model residency, or client setup than a fundamental model-speed problem.

// ANALYSIS

The interesting part here is that this looks like a workflow bug, not a model-quality issue. Ollama is built for local, programmatic inference, and its docs make `keep_alive` a first-class knob, which strongly suggests the interactive session may be benefiting from a warmer, stickier model state than the API path.

  • Interactive chat shells often keep context and the model resident longer, while API code can accidentally pay cold-start or reload costs on every request.
  • JSON-enforced parsing can add token overhead, especially if the app is sending large prompts or repeatedly recreating the client/session.
  • If the GUI queues entries and fires them serially, any per-call setup cost gets multiplied fast and can feel like a 30-40x slowdown.
  • The core lesson for local LLM apps: measure end-to-end latency separately from model generation speed, because plumbing overhead can dominate the actual inference.
  • For developers building automation around Ollama, this is a reminder to validate connection reuse, streaming behavior, and model retention before blaming the model.
// TAGS
ollamaapiinferencellmself-hostedautomation

DISCOVERED

25d ago

2026-03-18

PUBLISHED

25d ago

2026-03-18

RELEVANCE

7/ 10

AUTHOR

shooteverywhere