While auxiliary information has become a key to enhance Large Language Models (LLMs), relatively little is known about how LLMs merge these contexts, specifically generated and retrieved. To study this, we formulate a systematic framework to identify whether LLMs' responses, derived from the integration of generated and retrieved contexts, are attributed to either generated or retrieved contexts. To achieve this, we construct datasets with conflicting contexts, where each question is paired with both generated and retrieved contexts, yet only one of them contains the correct answer. Our experiments reveal a significant bias in LLMs (GPT-4/3.5 and Llama2) towards generated contexts, even when they provide incorrect information. We further identify two key factors contributing to this bias: i) contexts generated by LLMs typically show greater similarity to the questions, increasing their likelihood of selection; ii) the segmentation process used in retrieved contexts disrupts their completeness, thereby hindering their full utilization in LLMs. Our analysis enhances the understanding of how LLMs merge diverse contexts, offering valuable insights for advancing current augmentation methods for LLMs.
翻译:尽管辅助信息已成为增强大型语言模型(LLMs)的关键,但关于LLMs如何融合这些语境(特别是生成语境和检索语境)的研究仍相对较少。为探究这一问题,我们构建了一个系统性框架,旨在识别LLMs在整合生成与检索语境后所得响应究竟归因于哪一类语境。为此,我们构建了包含冲突语境的数数据集,其中每个问题均配有一组生成语境和一组检索语境,但仅其中一组包含正确答案。实验表明,LLMs(GPT-4/3.5和Llama2)显著偏向于生成语境,即便这些语境提供错误信息时也是如此。我们进一步识别出导致该偏好的两个关键因素:第一,LLMs生成的语境通常与问题语义相似度更高,从而增加了被选择的可能性;第二,检索语境中使用的分段处理破坏了其完整性,阻碍了LLMs对其充分运用。本分析深化了对LLMs如何融合多样化语境的理解,为改进当前LLMs增强方法提供了重要启示。