Traditional Information Retrieval (IR) metrics, such as nDCG, MAP, and MRR, assume that human users sequentially examine documents with diminishing attention to lower ranks. This assumption breaks down in Retrieval Augmented Generation (RAG) systems, where search results are consumed by Large Language Models (LLMs), which, unlike humans, process all retrieved documents as a whole rather than sequentially. Additionally, traditional IR metrics do not account for related but irrelevant documents that actively degrade generation quality, rather than merely being ignored. Due to these two major misalignments, namely human vs. machine position discount and human relevance vs. machine utility, classical IR metrics do not accurately predict RAG performance. We introduce a utility-based annotation schema that quantifies both the positive contribution of relevant passages and the negative impact of distracting ones. Building on this foundation, we propose UDCG (Utility and Distraction-aware Cumulative Gain), a metric using an LLM-oriented positional discount to directly optimize the correlation with the end-to-end answer accuracy. Experiments on five datasets and six LLMs demonstrate that UDCG improves correlation by up to 36% compared to traditional metrics. Our work provides a critical step toward aligning IR evaluation with LLM consumers and enables more reliable assessment of RAG components
翻译:传统的信息检索(IR)评估指标,如nDCG、MAP和MRR,假设用户会按顺序审阅文档,且对较低排名文档的关注度递减。这一假设在检索增强生成(RAG)系统中不再成立,因为检索结果是由大语言模型(LLMs)处理的。与人类不同,LLMs将检索到的所有文档作为一个整体进行处理,而非顺序审阅。此外,传统IR指标未能考虑那些相关但无关的文档,这些文档会主动降低生成质量,而不仅仅是被人忽略。由于存在两大错位——即人类与机器的位置折扣差异,以及人类相关性判断与机器效用判断的差异——经典IR指标无法准确预测RAG系统的性能。我们引入了一种基于效用的标注方案,该方案同时量化了相关段落的正面贡献和干扰段落的负面影响。在此基础上,我们提出了UDCG(Utility and Distraction-aware Cumulative Gain),该指标采用面向LLM的位置折扣,以直接优化与端到端答案准确性的相关性。在五个数据集和六个LLM上的实验表明,与传统指标相比,UDCG将相关性提高了高达36%。我们的工作为将IR评估与LLM使用者对齐迈出了关键一步,并使得对RAG组件的评估更加可靠。