REDDIT · REDDIT// 2h agoINFRASTRUCTURE

AST graphs, BM25 cut code RAG tokens

The post describes a code retrieval pipeline that parses repositories with Tree-sitter, turns symbols and dependencies into a typed graph, and uses BM25 over node metadata plus edge traversal to keep LLM context around 5K tokens. It is a practical, lexical-first alternative to chunk-and-embed retrieval for codebases, but the benchmark claim is still anecdotal.

// ANALYSIS

The idea is solid because code search is usually about exact identifiers, signatures, and dependency edges more than fuzzy semantic similarity. The weak spot is that BM25-only retrieval can miss intent-heavy queries, so the real question is whether this beats a hybrid stack once you measure recall and answer quality.

–Tree-sitter plus typed nodes gives you structural recall that plain chunk embeddings routinely miss in codebases.
–BM25 is a strong default for symbol-heavy queries like function names, file paths, imports, and type references.
–The 5K-vs-100K token story is compelling, but it needs hard numbers: recall@k, task success rate, latency, and cost versus hybrid vector plus reranker baselines.
–Unweighted edges are likely the first bottleneck; call, import, and inheritance links should not all expand context equally.
–Static graph extraction will work best in typed or convention-heavy codebases; dynamic languages, reflection, and runtime dispatch will be the stress cases.

// TAGS

ragsearchembeddingdata-toolsast-derived-graph-rag

DISCOVERED

2h ago

2026-04-30

PUBLISHED

3h ago

2026-04-30

RELEVANCE

8/ 10

AUTHOR

Altruistic_Night_327