Neural Retrievers are Biased Towards LLM-Generated Content

Recently, the emergence of large language models (LLMs) has revolutionized the paradigm of information retrieval (IR) applications, especially in web search, by generating vast amounts of human-like texts on the Internet. As a result, IR systems in the LLM era are facing a new challenge: the indexed documents are now not only written by human beings but also automatically generated by the LLMs. How these LLM-generated documents influence the IR systems is a pressing and still unexplored question. In this work, we conduct a quantitative evaluation of IR models in scenarios where both human-written and LLM-generated texts are involved. Surprisingly, our findings indicate that neural retrieval models tend to rank LLM-generated documents higher. We refer to this category of biases in neural retrievers towards the LLM-generated content as the \textbf{source bias}. Moreover, we discover that this bias is not confined to the first-stage neural retrievers, but extends to the second-stage neural re-rankers. Then, in-depth analyses from the perspective of text compression indicate that LLM-generated texts exhibit more focused semantics with less noise, making it easier for neural retrieval models to semantic match. To mitigate the source bias, we also propose a plug-and-play debiased constraint for the optimization objective, and experimental results show its effectiveness. Finally, we discuss the potential severe concerns stemming from the observed source bias and hope our findings can serve as a critical wake-up call to the IR community and beyond. To facilitate future explorations of IR in the LLM era, the constructed two new benchmarks are available at https://github.com/KID-22/Source-Bias.

翻译：近年来，大型语言模型（LLMs）的出现彻底改变了信息检索（IR）应用的范式，特别是在网络搜索领域，其在互联网上生成了大量类人文本。因此，LLM时代的信息检索系统正面临一项新挑战：被索引的文档如今不仅由人类撰写，也由LLMs自动生成。这些LLM生成的文档如何影响信息检索系统，是一个紧迫且尚未被探索的问题。在本工作中，我们对涉及人类撰写文本和LLM生成文本的场景下的信息检索模型进行了定量评估。令人惊讶的是，我们的研究结果表明，神经检索模型倾向于将LLM生成的文档排名更高。我们将神经检索模型对LLM生成内容的此类偏好称为\textbf{来源偏差}。此外，我们发现这种偏差不仅存在于第一阶段的神经检索模型，还延伸至第二阶段的神经重排序模型。随后，从文本压缩视角进行的深入分析表明，LLM生成的文本表现出更聚焦的语义和更少的噪声，使得神经检索模型更容易进行语义匹配。为了缓解来源偏差，我们还为优化目标提出了一种即插即用的去偏约束，实验结果表明了其有效性。最后，我们讨论了所观察到的来源偏差可能引发的严重关切，并希望我们的发现能为信息检索领域乃至更广泛的社区敲响关键的警钟。为了促进未来在LLM时代对信息检索的探索，所构建的两个新基准测试集可在 https://github.com/KID-22/Source-Bias 获取。