
Cull open-sources image dataset curation pipeline
Cull is an open-source, single-machine workflow for building and cleaning AI image datasets. It pulls images and source prompts from many scrapers, deduplicates locally, classifies images with configurable vision workers and a strict JSON schema, then sorts keepers into category folders with prompt files and audit records. The project targets LoRA prep, large-scale finetune dataset curation, and prompt-less archives that need auto-captioning.
Strong release for anyone doing image-model data work locally, because it combines collection, triage, captioning, and export in one tool instead of forcing a stitched-together stack.
- –The scope is unusually practical: scraping, dedup, classification, captioning, and export are all in one loop.
- –The pluggable vision-worker design is the real differentiator, especially if you want local models, LM Studio, Groq, or other OpenAI-compatible backends.
- –The strict schema and audit outputs should reduce the usual “LLM said something vaguely useful” problem.
- –Best fit is niche but real: people curating LoRA datasets, reference libraries, or messy archives on a single machine.
- –No Product Hunt listing was found for Cull, so there is no PH URL to include.
DISCOVERED
3h ago
2026-05-11
PUBLISHED
5h ago
2026-05-10
RELEVANCE
AUTHOR
Compunerd3