OPEN_SOURCE ↗
REDDIT · REDDIT// 25d agoINFRASTRUCTURE
Ollama API Calls Lag Interactive Chats
A Reddit user says the same Qwen 3.5 prompt runs 30-40x slower through Ollama’s API than in the interactive shell while generating JSON for a database workflow. The gap sounds more like request handling, model residency, or client setup than a fundamental model-speed problem.
// ANALYSIS
The interesting part here is that this looks like a workflow bug, not a model-quality issue. Ollama is built for local, programmatic inference, and its docs make `keep_alive` a first-class knob, which strongly suggests the interactive session may be benefiting from a warmer, stickier model state than the API path.
- –Interactive chat shells often keep context and the model resident longer, while API code can accidentally pay cold-start or reload costs on every request.
- –JSON-enforced parsing can add token overhead, especially if the app is sending large prompts or repeatedly recreating the client/session.
- –If the GUI queues entries and fires them serially, any per-call setup cost gets multiplied fast and can feel like a 30-40x slowdown.
- –The core lesson for local LLM apps: measure end-to-end latency separately from model generation speed, because plumbing overhead can dominate the actual inference.
- –For developers building automation around Ollama, this is a reminder to validate connection reuse, streaming behavior, and model retention before blaming the model.
// TAGS
ollamaapiinferencellmself-hostedautomation
DISCOVERED
25d ago
2026-03-18
PUBLISHED
25d ago
2026-03-18
RELEVANCE
7/ 10
AUTHOR
shooteverywhere