The proliferation of Large Language Models (LLMs) has led to an influx of AI-generated content (AIGC) on the internet, transforming the corpus of Information Retrieval (IR) systems from solely human-written to a coexistence with LLM-generated content. The impact of this surge in AIGC on IR systems remains an open question, with the primary challenge being the lack of a dedicated benchmark for researchers. In this paper, we introduce Cocktail, a comprehensive benchmark tailored for evaluating IR models in this mixed-sourced data landscape of the LLM era. Cocktail consists of 16 diverse datasets with mixed human-written and LLM-generated corpora across various text retrieval tasks and domains. Additionally, to avoid the potential bias from previously included dataset information in LLMs, we also introduce an up-to-date dataset, named NQ-UTD, with queries derived from recent events. Through conducting over 1,000 experiments to assess state-of-the-art retrieval models against the benchmarked datasets in Cocktail, we uncover a clear trade-off between ranking performance and source bias in neural retrieval models, highlighting the necessity for a balanced approach in designing future IR systems. We hope Cocktail can serve as a foundational resource for IR research in the LLM era, with all data and code publicly available at \url{https://github.com/KID-22/Cocktail}.
翻译:大型语言模型(LLM)的激增导致互联网上人工智能生成内容(AIGC)大量涌现,使得信息检索(IR)系统的语料库从纯人工撰写内容转变为与LLM生成内容共存的状态。AIGC的急剧增长对IR系统的影响仍是一个开放性问题,其主要挑战在于缺乏专门的研究基准。本文提出鸡尾酒(Cocktail),这是一个专为评估LLM时代混合来源数据环境下IR模型而设计的综合性基准。Cocktail包含16个多样化数据集,涵盖不同文本检索任务和领域,其语料库混合了人工撰写与LLM生成内容。此外,为避免LLM中已包含数据集信息可能导致的偏差,我们还引入了一个基于近期事件构建查询的最新数据集NQ-UTD。通过在Cocktail基准数据集上对前沿检索模型进行超过1,000次实验评估,我们揭示了神经检索模型在排序性能与来源偏差间存在明显权衡,这凸显了未来IR系统设计中采取平衡方法的必要性。我们希望Cocktail能成为LLM时代IR研究的基础资源,所有数据与代码均已公开于\url{https://github.com/KID-22/Cocktail}。