重新定义大语言模型时代的信息检索评估 (Redefining Retrieval Evaluation in the Era of LLMs)

Traditional Information Retrieval (IR) metrics, such as nDCG, MAP, and MRR, assume that human users sequentially examine documents with diminishing attention to lower ranks. This assumption breaks down in Retrieval Augmented Generation (RAG) systems, where search results are consumed by Large Language Models (LLMs), which, unlike humans, process all retrieved documents as a whole rather than sequentially. Additionally, traditional IR metrics do not account for related but irrelevant documents that actively degrade generation quality, rather than merely being ignored. Due to these two major misalignments, namely human vs. machine position discount and human relevance vs. machine utility, classical IR metrics do not accurately predict RAG performance. We introduce a utility-based annotation schema that quantifies both the positive contribution of relevant passages and the negative impact of distracting ones. Building on this foundation, we propose UDCG (Utility and Distraction-aware Cumulative Gain), a metric using an LLM-oriented positional discount to directly optimize the correlation with the end-to-end answer accuracy. Experiments on five datasets and six LLMs demonstrate that UDCG improves correlation by up to 36% compared to traditional metrics. Our work provides a critical step toward aligning IR evaluation with LLM consumers and enables more reliable assessment of RAG components

翻译：传统的信息检索（IR）评估指标，如nDCG、MAP和MRR，假设用户会按顺序审阅文档，且对较低排名文档的关注度递减。这一假设在检索增强生成（RAG）系统中不再成立，因为检索结果是由大语言模型（LLMs）处理的。与人类不同，LLMs将检索到的所有文档作为一个整体进行处理，而非顺序审阅。此外，传统IR指标未能考虑那些相关但无关的文档，这些文档会主动降低生成质量，而不仅仅是被人忽略。由于存在两大错位——即人类与机器的位置折扣差异，以及人类相关性判断与机器效用判断的差异——经典IR指标无法准确预测RAG系统的性能。我们引入了一种基于效用的标注方案，该方案同时量化了相关段落的正面贡献和干扰段落的负面影响。在此基础上，我们提出了UDCG（Utility and Distraction-aware Cumulative Gain），该指标采用面向LLM的位置折扣，以直接优化与端到端答案准确性的相关性。在五个数据集和六个LLM上的实验表明，与传统指标相比，UDCG将相关性提高了高达36%。我们的工作为将IR评估与LLM使用者对齐迈出了关键一步，并使得对RAG组件的评估更加可靠。

相关内容

关注 14

信息检索杂志（IR）为信息检索的广泛领域中的理论、算法分析和实验的发布提供了一个国际论坛。感兴趣的主题包括对应用程序（例如Web，社交和流媒体，推荐系统和文本档案）的搜索、索引、分析和评估。这包括对搜索中人为因素的研究、桥接人工智能和信息检索以及特定领域的搜索应用程序。官网地址：https://dblp.uni-trier.de/db/journals/ir/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日