ClashEval: Quantifying the tug-of-war between an LLM's internal prior and external evidence

Retrieval augmented generation (RAG) is frequently used to mitigate hallucinations and provide up-to-date knowledge for large language models (LLMs). However, given that document retrieval is an imprecise task and sometimes results in erroneous or even harmful content being presented in context, this raises the question of how LLMs handle retrieved information: If the provided content is incorrect, does the model know to ignore it, or does it recapitulate the error? Conversely, when the model's initial response is incorrect, does it always know to use the retrieved information to correct itself, or does it insist on its wrong prior response? To answer this, we curate a dataset of over 1200 questions across six domains (e.g., drug dosages, Olympic records, locations) along with content relevant to answering each question. We further apply precise perturbations to the answers in the content that range from subtle to blatant errors. We benchmark six top-performing LLMs, including GPT-4o, on this dataset and find that LLMs are susceptible to adopting incorrect retrieved content, overriding their own correct prior knowledge over 60% of the time. However, the more unrealistic the retrieved content is (i.e. more deviated from truth), the less likely the model is to adopt it. Also, the less confident a model is in its initial response (via measuring token probabilities), the more likely it is to adopt the information in the retrieved content. We exploit this finding and demonstrate simple methods for improving model accuracy where there is conflicting retrieved content. Our results highlight a difficult task and benchmark for LLMs -- namely, their ability to correctly discern when it is wrong in light of correct retrieved content and to reject cases when the provided content is incorrect.

翻译：检索增强生成（RAG）常被用于缓解大型语言模型（LLM）的幻觉问题，并为其提供最新知识。然而，由于文档检索是一项不精确的任务，有时会导致错误甚至有害的内容被置于上下文中，这就引出了一个关键问题：LLM如何处理检索到的信息？如果提供的内容是错误的，模型是否知道忽略它，还是会复现该错误？反之，当模型的初始回答错误时，它是否总能知道利用检索到的信息来修正自身，还是会固执地坚持其错误的先验回答？为探究此问题，我们构建了一个包含六个领域（例如药物剂量、奥运纪录、地理位置）超过1200个问题的数据集，并为每个问题收集了相关解答内容。我们进一步对这些内容中的答案施加了从细微到明显错误的精确扰动。在此数据集上，我们对包括GPT-4o在内的六个顶尖LLM进行了基准测试，发现LLM容易采纳错误的检索内容，超过60%的情况下会覆盖自身原本正确的先验知识。然而，检索内容越不现实（即与事实偏差越大），模型采纳它的可能性就越低。同时，模型对其初始回答的置信度越低（通过词元概率衡量），它就越可能采纳检索内容中的信息。我们利用这一发现，展示了在检索内容存在冲突时提升模型准确性的简单方法。我们的研究结果突显了LLM面临的一项艰巨任务与基准——即其能否在检索到正确内容时准确识别自身错误，并在提供内容错误时予以拒绝。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日