Retrieval-augmented generation (RAG) improves Large Language Models (LLMs) by incorporating external information into the response generation process. However, how context-faithful LLMs are and what factors influence LLMs' context-faithfulness remain largely unexplored. In this study, we investigate the impact of memory strength and evidence presentation on LLMs' receptiveness to external evidence. We introduce a method to quantify the memory strength of LLMs by measuring the divergence in LLMs' responses to different paraphrases of the same question, which is not considered by previous works. We also generate evidence in various styles to evaluate the effects of evidence in different styles. Two datasets are used for evaluation: Natural Questions (NQ) with popular questions and popQA featuring long-tail questions. Our results show that for questions with high memory strength, LLMs are more likely to rely on internal memory, particularly for larger LLMs such as GPT-4. On the other hand, presenting paraphrased evidence significantly increases LLMs' receptiveness compared to simple repetition or adding details.
翻译:检索增强生成(RAG)通过将外部信息整合到响应生成过程中,改进了大型语言模型(LLMs)。然而,LLMs在多大程度上忠实于上下文,以及哪些因素影响LLMs的上下文忠实性,在很大程度上仍未得到充分探索。在本研究中,我们探究了记忆强度和证据呈现方式对LLMs接受外部证据的影响。我们引入了一种量化LLMs记忆强度的方法,通过测量LLMs对同一问题不同释义版本响应的差异来实现,这一点在先前工作中未被考虑。我们还生成了多种风格的证据,以评估不同风格证据的效果。评估使用了两个数据集:包含常见问题的Natural Questions(NQ)和以长尾问题为特色的popQA。我们的结果表明,对于记忆强度高的问题,LLMs更倾向于依赖内部记忆,尤其是对于GPT-4等更大规模的LLMs。另一方面,与简单重复或添加细节相比,呈现释义后的证据能显著提高LLMs对证据的接受度。