RAG retrieval quality faces production test
A r/LocalLLaMA discussion digs into how teams measure whether retrieved chunks are actually relevant before they reach the prompt. The practical consensus leans toward layered evaluation: offline golden sets for ground truth, LLM judges for scale, and user behavior signals to catch misses in production.
The key lesson is that retrieval quality is a systems problem, not a single score. Teams that get real value combine labeled evals, online monitoring, and retrieval-stack fixes instead of treating embedding similarity as the answer.
- –Build a real golden set from production queries and score recall@k and MRR against it.
- –Use LLM-as-judge for scalable relevance checks, but calibrate it to human labels and keep the rubric simple.
- –Hybrid search, query expansion, and metadata filters often outperform more embedding tuning.
- –Watch reformulations, follow-up questions, and thumbs down as the best production canaries.
DISCOVERED
83d ago
2026-03-18
PUBLISHED
84d ago
2026-03-18
RELEVANCE
AUTHOR
Kapil_Soni