OPEN_SOURCE ↗
REDDIT · REDDIT// 37d agoTUTORIAL
Schema-aware chunking beats flattening for JSON RAG
A Reddit thread in r/LocalLLaMA asks how to chunk 25k-row, moderately nested JSON files for retrieval in Chroma when schemas vary widely and some fields are too large to pass through directly. The discussion centers on whether key-wise chunking, flattening, or LangChain-style JSON parsing is the better fit for messy structured data at RAG scale.
// ANALYSIS
The interesting part here is not Chroma itself but the ingestion design problem: naive flattening usually throws away structure, while schema-aware chunking preserves the relationships retrieval actually needs.
- –Chroma is built for embeddings plus metadata filtering, so JSON records usually work best when transformed into semantically meaningful slices instead of dumped in raw
- –The poster’s success with key-wise chunking matches a common RAG pattern: chunk by logical entity or field group, not by arbitrary token windows
- –Flattening can help for normalization, but doing it blindly on heterogeneous schemas often destroys parent-child context and makes retrieval noisier
- –Varying schemas across many files point toward a preprocessing layer that maps each source into a consistent intermediate shape before embedding
- –For AI developers, this is a practical reminder that retrieval quality is often won or lost in data modeling long before vector search enters the loop
// TAGS
chromavector-dbragdata-toolslangchain
DISCOVERED
37d ago
2026-03-06
PUBLISHED
37d ago
2026-03-06
RELEVANCE
6/ 10
AUTHOR
jay_solanki