Retrieval quality is the primary bottleneck for accuracy and robustness in retrieval-augmented generation (RAG). Current evaluation relies on heuristically constructed query sets, which introduce a hidden intrinsic bias. We formalize retrieval evaluation as a statistical estimation problem, showing that metric reliability is fundamentally limited by the evaluation-set construction. We further introduce \emph{semantic stratification}, which grounds evaluation in corpus structure by organizing documents into an interpretable global space of entity-based clusters and systematically generating queries for missing strata. This yields (1) formal semantic coverage guarantees across retrieval regimes and (2) interpretable visibility into retrieval failure modes. Experiments across multiple benchmarks and retrieval methods validate our framework. The results expose systematic coverage gaps, identify structural signals that explain variance in retrieval performance, and show that stratified evaluation yields more stable and transparent assessments while supporting more trustworthy decision-making than aggregate metrics.
翻译:检索质量是检索增强生成(RAG)在准确性和鲁棒性方面的主要瓶颈。当前的评估依赖启发式构建的查询集,这引入了隐含的内在偏差。我们将检索评估形式化为一个统计估计问题,表明度量可靠性从根本上受限于评估集的构建方式。我们进一步引入**语义分层**方法,该方法通过将文档组织成基于实体的可解释全局聚类空间,并系统性地为缺失层生成查询,将评估建立在语料库结构之上。这带来了:(1)跨检索机制的正式语义覆盖保证,以及(2)对检索失败模式的可解释洞察。跨多个基准测试和检索方法的实验验证了我们的框架。结果揭示了系统性的覆盖缺口,识别出解释检索性能差异的结构性信号,并表明相较于聚合度量,分层评估能提供更稳定、更透明的评估结果,同时支持更可信的决策制定。