大语言模型是存在偏见的评估者，但在检索增强生成中并非如此 (LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation)

Recent studies have demonstrated that large language models (LLMs) exhibit significant biases in evaluation tasks, particularly in preferentially rating and favoring self-generated content. However, the extent to which this bias manifests in fact-oriented tasks, especially within retrieval-augmented generation (RAG) frameworks, where keyword extraction and factual accuracy take precedence over stylistic elements, remains unclear. Our study addresses this knowledge gap by simulating two critical phases of the RAG framework. In the first phase, LLMs evaluated human-authored and model-generated passages, emulating the \textit{pointwise reranking phase}. The second phase involves conducting pairwise reading comprehension tests to simulate the \textit{generation phase}. Contrary to previous findings indicating a self-preference in rating tasks, our results reveal no significant self-preference effect in RAG frameworks. Instead, we observe that factual accuracy significantly influences LLMs' output, even in the absence of prior knowledge. These findings are consistent among three common QA datasets (NQ, MARCO, TriviaQA Datasets) and 5 widely adopted language models (GPT-3.5, GPT-4o-mini, Gemini, LLaMA3, and Mistral). Our research contributes to the ongoing discourse on LLM biases and their implications for RAG-based system, offering insights that may inform the development of more robust and unbiased LLM systems.

翻译：近期研究表明，大语言模型（LLMs）在评估任务中表现出显著的偏见，尤其是在评分时倾向于偏爱自身生成的内容。然而，这种偏见在事实导向任务中的表现程度，尤其是在检索增强生成（RAG）框架内——其中关键词提取和事实准确性优先于风格元素——仍不明确。本研究通过模拟RAG框架的两个关键阶段来填补这一知识空白。在第一阶段，LLMs评估了人类撰写和模型生成的段落，模拟了“逐点重排序阶段”。第二阶段则通过进行成对阅读理解测试来模拟“生成阶段”。与先前发现表明评分任务中存在自我偏好的结果相反，我们的结果显示在RAG框架中未观察到显著的自我偏好效应。相反，我们观察到事实准确性显著影响LLMs的输出，即使在缺乏先验知识的情况下也是如此。这些发现在三个常见的问答数据集（NQ、MARCO、TriviaQA数据集）和五种广泛采用的语言模型（GPT-3.5、GPT-4o-mini、Gemini、LLaMA3和Mistral）中保持一致。我们的研究为关于LLM偏见及其对基于RAG的系统影响的持续讨论做出了贡献，提供了可能有助于开发更稳健、无偏见的LLM系统的见解。

相关内容