OPEN_SOURCE ↗
REDDIT · REDDIT// 24d agoINFRASTRUCTURE
Ollama vision pipelines hit throughput wall
A Reddit user running Qwen3.5:9B through Ollama on an M3 Ultra and an RTX 5070 Ti says the setup only classifies 4-6 JPGs per minute, far short of the 10x speedup needed for a million-image backlog. They’re asking for better ways to structure the pipeline after Tesseract preprocessing produced garbage.
// ANALYSIS
The blunt takeaway: this is an inference-pipeline problem more than a raw-hardware problem. If you want 10x, the win likely comes from smarter routing, batching, and using smaller specialists for first-pass filtering, not just bigger boxes.
- –4-6 images per minute suggests per-request overhead and serial processing are crushing throughput
- –A cheap classifier-first pass can route only likely documents to OCR or a heavier VLM
- –Smaller vision models or quantized variants may be enough for photo/email/document triage
- –If the task is fixed-label classification, a fine-tuned CV model or OCR+rules stack may beat a general-purpose 9B VLM
- –The RTX 5070 Ti’s 16 GB VRAM and Ollama’s orchestration overhead both make batch scaling hard
// TAGS
ollamallmmultimodalinferencegpuself-hostedopen-source
DISCOVERED
24d ago
2026-03-18
PUBLISHED
24d ago
2026-03-18
RELEVANCE
8/ 10
AUTHOR
Turbulent-Week1136