LLMs may Dominate Information Access: Neural Retrievers are Biased Towards LLM-Generated Texts

Recently, the emergence of large language models (LLMs) has revolutionized the paradigm of information retrieval (IR) applications, especially in web search. With their remarkable capabilities in generating human-like texts, LLMs have created enormous texts on the Internet. As a result, IR systems in the LLMs era are facing a new challenge: the indexed documents now are not only written by human beings but also automatically generated by the LLMs. How these LLM-generated documents influence the IR systems is a pressing and still unexplored question. In this work, we conduct a quantitative evaluation of different IR models in scenarios where both human-written and LLM-generated texts are involved. Surprisingly, our findings indicate that neural retrieval models tend to rank LLM-generated documents higher.We refer to this category of biases in neural retrieval models towards the LLM-generated text as the \textbf{source bias}. Moreover, we discover that this bias is not confined to the first-stage neural retrievers, but extends to the second-stage neural re-rankers. Then, we provide an in-depth analysis from the perspective of text compression and observe that neural models can better understand the semantic information of LLM-generated text, which is further substantiated by our theoretical analysis.We also discuss the potential server concerns stemming from the observed source bias and hope our findings can serve as a critical wake-up call to the IR community and beyond. To facilitate future explorations of IR in the LLM era, the constructed two new benchmarks and codes will later be available at \url{https://github.com/KID-22/LLM4IR-Bias}.

翻译：近期，大语言模型（LLMs）的出现在信息检索（IR）应用领域，尤其是在网络搜索中，引发了范式革命。凭借生成类人文本的卓越能力，LLMs已在互联网上创造了海量文本。因此，LLMs时代的IR系统正面临新挑战：索引文档不仅由人类撰写，还包含LLMs自动生成的内容。这些LLM生成文档如何影响IR系统，是一个紧迫且尚未探索的问题。在本研究中，我们对同时包含人类撰写和LLM生成文本的场景进行了不同IR模型的定量评估。令人惊讶的是，我们的发现表明神经检索模型倾向于将LLM生成的文档排序更高。我们将神经检索模型对LLM生成文本的这一类偏见称为**源偏见**。此外，我们发现这种偏见不仅局限于第一阶段的神经检索器，还延伸到第二阶段的神经重排序器。随后，我们从文本压缩的角度进行了深入分析，观察到神经模型能更好地理解LLM生成文本的语义信息，这一观点进一步得到了理论分析的证实。我们还讨论了观察到的源偏见可能引发的潜在服务器问题，并希望我们的发现能对IR领域及更广泛学界起到关键警示作用。为促进LLM时代IR的未来探索，构建的两个新基准和代码随后将在\url{https://github.com/KID-22/LLM4IR-Bias}公开。