OPEN_SOURCE ↗
REDDIT · REDDIT// 10d agoINFRASTRUCTURE
Legal RAG corpus maps 529K sections
A solo builder scraped 50 state legislature sites, normalized 529K statutory sections, and linked them with 487K citation and cross-reference edges. The result is a legal retrieval stack that combines BM25, dense search, and graph traversal, then exposes it through an MCP server for LLM clients.
// ANALYSIS
This reads like a real-world proof that legal RAG is still mostly a retrieval-engineering problem, not an embeddings problem. The graph layer is the interesting part: once you have clean citations and cross-references, personalization over structure can surface relevant provisions that semantic search alone misses.
- –BM25 matters here because legal queries often hinge on exact section numbers, defined terms, and statutory phrasing
- –Dense retrieval still adds value for cross-jurisdiction similarity, where wording diverges but legal function is the same
- –Citation graphs plus PageRank are the differentiator because they encode how statutes actually relate, not just how they sound
- –The data work is the hard moat: 50 scraper variants, normalization quirks, and edge resolution are where most teams would stall
- –MCP exposure makes the corpus immediately useful to agents, which is a cleaner distribution story than shipping yet another search UI
// TAGS
legal-rag-corpusragsearchembeddingmcpdata-toolsagent
DISCOVERED
10d ago
2026-04-01
PUBLISHED
11d ago
2026-04-01
RELEVANCE
8/ 10
AUTHOR
Low-Medium-4320