BACK_TO_FEEDAICRIER_2
Ollama vision pipelines hit throughput wall
OPEN_SOURCE ↗
REDDIT · REDDIT// 24d agoINFRASTRUCTURE

Ollama vision pipelines hit throughput wall

A Reddit user running Qwen3.5:9B through Ollama on an M3 Ultra and an RTX 5070 Ti says the setup only classifies 4-6 JPGs per minute, far short of the 10x speedup needed for a million-image backlog. They’re asking for better ways to structure the pipeline after Tesseract preprocessing produced garbage.

// ANALYSIS

The blunt takeaway: this is an inference-pipeline problem more than a raw-hardware problem. If you want 10x, the win likely comes from smarter routing, batching, and using smaller specialists for first-pass filtering, not just bigger boxes.

  • 4-6 images per minute suggests per-request overhead and serial processing are crushing throughput
  • A cheap classifier-first pass can route only likely documents to OCR or a heavier VLM
  • Smaller vision models or quantized variants may be enough for photo/email/document triage
  • If the task is fixed-label classification, a fine-tuned CV model or OCR+rules stack may beat a general-purpose 9B VLM
  • The RTX 5070 Ti’s 16 GB VRAM and Ollama’s orchestration overhead both make batch scaling hard
// TAGS
ollamallmmultimodalinferencegpuself-hostedopen-source

DISCOVERED

24d ago

2026-03-18

PUBLISHED

24d ago

2026-03-18

RELEVANCE

8/ 10

AUTHOR

Turbulent-Week1136