Retrieval augmented generation (RAG) is frequently used to mitigate hallucinations and provide up-to-date knowledge for large language models (LLMs). However, given that document retrieval is an imprecise task and sometimes results in erroneous or even harmful content being presented in context, this raises the question of how LLMs handle retrieved information: If the provided content is incorrect, does the model know to ignore it, or does it recapitulate the error? Conversely, when the model's initial response is incorrect, does it always know to use the retrieved information to correct itself, or does it insist on its wrong prior response? To answer this, we curate a dataset of over 1200 questions across six domains (e.g., drug dosages, Olympic records, locations) along with content relevant to answering each question. We further apply precise perturbations to the answers in the content that range from subtle to blatant errors. We benchmark six top-performing LLMs, including GPT-4o, on this dataset and find that LLMs are susceptible to adopting incorrect retrieved content, overriding their own correct prior knowledge over 60% of the time. However, the more unrealistic the retrieved content is (i.e. more deviated from truth), the less likely the model is to adopt it. Also, the less confident a model is in its initial response (via measuring token probabilities), the more likely it is to adopt the information in the retrieved content. We exploit this finding and demonstrate simple methods for improving model accuracy where there is conflicting retrieved content. Our results highlight a difficult task and benchmark for LLMs -- namely, their ability to correctly discern when it is wrong in light of correct retrieved content and to reject cases when the provided content is incorrect.
翻译:检索增强生成(RAG)常被用于缓解大型语言模型(LLM)的幻觉问题,并为其提供最新知识。然而,由于文档检索是一项不精确的任务,有时会导致错误甚至有害的内容被置于上下文中,这就引出了一个关键问题:LLM如何处理检索到的信息?如果提供的内容是错误的,模型是否知道忽略它,还是会复现该错误?反之,当模型的初始回答错误时,它是否总能知道利用检索到的信息来修正自身,还是会固执地坚持其错误的先验回答?为探究此问题,我们构建了一个包含六个领域(例如药物剂量、奥运纪录、地理位置)超过1200个问题的数据集,并为每个问题收集了相关解答内容。我们进一步对这些内容中的答案施加了从细微到明显错误的精确扰动。在此数据集上,我们对包括GPT-4o在内的六个顶尖LLM进行了基准测试,发现LLM容易采纳错误的检索内容,超过60%的情况下会覆盖自身原本正确的先验知识。然而,检索内容越不现实(即与事实偏差越大),模型采纳它的可能性就越低。同时,模型对其初始回答的置信度越低(通过词元概率衡量),它就越可能采纳检索内容中的信息。我们利用这一发现,展示了在检索内容存在冲突时提升模型准确性的简单方法。我们的研究结果突显了LLM面临的一项艰巨任务与基准——即其能否在检索到正确内容时准确识别自身错误,并在提供内容错误时予以拒绝。