The quality of answers generated by large language models (LLMs) in retrieval-augmented generation (RAG) is largely influenced by the contextual information contained in the retrieved documents. A key challenge for improving RAG is to predict both the utility of retrieved documents -- quantified as the performance gain from using context over generation without context -- and the quality of the final answers in terms of correctness and relevance. In this paper, we define two prediction tasks within RAG. The first is retrieval performance prediction (RPP), which estimates the utility of retrieved documents. The second is generation performance prediction (GPP), which estimates the final answer quality. We hypothesise that in RAG, the topical relevance of retrieved documents correlates with their utility, suggesting that query performance prediction (QPP) approaches can be adapted for RPP and GPP. Beyond these retriever-centric signals, we argue that reader-centric features, such as the LLM's perplexity of the retrieved context conditioned on the input query, can further enhance prediction accuracy for both RPP and GPP. Finally, we propose that features reflecting query-agnostic document quality and readability can also provide useful signals to the predictions. We train linear regression models with the above categories of predictors for both RPP and GPP. Experiments on the Natural Questions (NQ) dataset show that combining predictors from multiple feature categories yields the most accurate estimates of RAG performance.
翻译:在检索增强生成(RAG)中,大型语言模型(LLM)所生成答案的质量很大程度上受检索文档所含上下文信息的影响。改进RAG的一个关键挑战在于同时预测检索文档的效用——量化为使用上下文相较于无上下文生成的性能增益——以及最终答案在正确性与相关性方面的质量。本文在RAG框架内定义了两个预测任务:其一是检索性能预测(RPP),用于估计检索文档的效用;其二是生成性能预测(GPP),用于估计最终答案的质量。我们假设在RAG中,检索文档的主题相关性与其效用存在关联,这表明查询性能预测(QPP)方法可适用于RPP与GPP。除了这些以检索器为中心的特征外,我们认为以阅读器为中心的特征——例如LLM在输入查询条件下对检索上下文的困惑度——能够进一步提升RPP与GPP的预测准确性。最后,我们提出反映查询无关的文档质量与可读性的特征也能为预测提供有效信号。我们使用上述多类别预测因子训练了针对RPP与GPP的线性回归模型。在Natural Questions(NQ)数据集上的实验表明,结合多特征类别的预测因子能对RAG性能做出最准确的估计。