Are Large Language Models Good at Utility Judgments?

Retrieval-augmented generation (RAG) is considered to be a promising approach to alleviate the hallucination issue of large language models (LLMs), and it has received widespread attention from researchers recently. Due to the limitation in the semantic understanding of retrieval models, the success of RAG heavily lies on the ability of LLMs to identify passages with utility. Recent efforts have explored the ability of LLMs to assess the relevance of passages in retrieval, but there has been limited work on evaluating the utility of passages in supporting question answering. In this work, we conduct a comprehensive study about the capabilities of LLMs in utility evaluation for open-domain QA. Specifically, we introduce a benchmarking procedure and collection of candidate passages with different characteristics, facilitating a series of experiments with five representative LLMs. Our experiments reveal that: (i) well-instructed LLMs can distinguish between relevance and utility, and that LLMs are highly receptive to newly generated counterfactual passages. Moreover, (ii) we scrutinize key factors that affect utility judgments in the instruction design. And finally, (iii) to verify the efficacy of utility judgments in practical retrieval augmentation applications, we delve into LLMs' QA capabilities using the evidence judged with utility and direct dense retrieval results. (iv) We propose a k-sampling, listwise approach to reduce the dependency of LLMs on the sequence of input passages, thereby facilitating subsequent answer generation. We believe that the way we formalize and study the problem along with our findings contributes to a critical assessment of retrieval-augmented LLMs. Our code and benchmark can be found at \url{https://github.com/ict-bigdatalab/utility_judgments}.

翻译：检索增强生成（RAG）被认为是缓解大型语言模型（LLM）幻觉问题的一种有前景的方法，近期受到了研究者的广泛关注。由于检索模型在语义理解方面的局限性，RAG的成功在很大程度上依赖于LLM识别具有效用文本段落的能力。近期研究探索了LLM评估检索中段落相关性的能力，但在评估段落对问答任务的支持效用方面的工作仍然有限。本研究针对开放域问答任务，对LLM的效用评估能力进行了全面研究。具体而言，我们引入了基准测试流程和具有不同特征的候选段落集合，并基于五个代表性LLM开展了一系列实验。实验结果表明：（i）经过良好指令调优的LLM能够区分相关性与效用，且对新生成的反事实段落具有高度敏感性；（ii）我们系统分析了指令设计中影响效用判断的关键因素；（iii）为验证效用判断在实际检索增强应用中的有效性，我们深入探究了LLM基于效用判断证据与直接稠密检索结果的问答能力；（iv）我们提出了一种k采样列表式方法，以降低LLM对输入段落顺序的依赖性，从而提升后续答案生成的质量。我们相信，本研究对问题的形式化定义、研究方法及发现，有助于对检索增强型LLM进行批判性评估。相关代码与基准测试集可通过 \url{https://github.com/ict-bigdatalab/utility_judgments} 获取。