AST graphs, BM25 cut code RAG tokens
The post describes a code retrieval pipeline that parses repositories with Tree-sitter, turns symbols and dependencies into a typed graph, and uses BM25 over node metadata plus edge traversal to keep LLM context around 5K tokens. It is a practical, lexical-first alternative to chunk-and-embed retrieval for codebases, but the benchmark claim is still anecdotal.
The idea is solid because code search is usually about exact identifiers, signatures, and dependency edges more than fuzzy semantic similarity. The weak spot is that BM25-only retrieval can miss intent-heavy queries, so the real question is whether this beats a hybrid stack once you measure recall and answer quality.
- –Tree-sitter plus typed nodes gives you structural recall that plain chunk embeddings routinely miss in codebases.
- –BM25 is a strong default for symbol-heavy queries like function names, file paths, imports, and type references.
- –The 5K-vs-100K token story is compelling, but it needs hard numbers: recall@k, task success rate, latency, and cost versus hybrid vector plus reranker baselines.
- –Unweighted edges are likely the first bottleneck; call, import, and inheritance links should not all expand context equally.
- –Static graph extraction will work best in typed or convention-heavy codebases; dynamic languages, reflection, and runtime dispatch will be the stress cases.
DISCOVERED
2h ago
2026-04-30
PUBLISHED
3h ago
2026-04-30
RELEVANCE
AUTHOR
Altruistic_Night_327