The rapid proliferation of AI-generated content on the Web presents a structural risk to information retrieval, as search engines and Retrieval-Augmented Generation (RAG) systems increasingly consume evidence produced by the Large Language Models (LLMs). We characterize this ecosystem-level failure mode as Retrieval Collapse, a two-stage process where (1) AI-generated content dominates search results, eroding source diversity, and (2) low-quality or adversarial content infiltrates the retrieval pipeline. We analyzed this dynamic through controlled experiments involving both high-quality SEO-style content and adversarially crafted content. In the SEO scenario, a 67\% pool contamination led to over 80\% exposure contamination, creating a homogenized yet deceptively healthy state where answer accuracy remains stable despite the reliance on synthetic sources. Conversely, under adversarial contamination, baselines like BM25 exposed $\sim$19\% of harmful content, whereas LLM-based rankers demonstrated stronger suppression capabilities. These findings highlight the risk of retrieval pipelines quietly shifting toward synthetic evidence and the need for retrieval-aware strategies to prevent a self-reinforcing cycle of quality decline in Web-grounded systems.
翻译:随着搜索引擎和检索增强生成(RAG)系统越来越多地使用大型语言模型(LLMs)生成的证据,网络上AI生成内容的快速扩散对信息检索构成了结构性风险。我们将这种生态系统层面的失效模式描述为“检索崩溃”,这是一个两阶段过程:(1)AI生成内容主导搜索结果,削弱了信息来源的多样性;(2)低质量或对抗性内容渗透到检索流程中。我们通过涉及高质量SEO风格内容和对抗性制作内容的受控实验分析了这一动态。在SEO场景中,67%的语料污染导致了超过80%的暴露污染,形成了一个同质化但看似健康的状态——尽管依赖合成来源,答案准确性仍保持稳定。相反,在对抗性污染下,BM25等基线方法暴露了约19%的有害内容,而基于LLM的排序器则表现出更强的抑制能力。这些发现突显了检索流程悄然转向合成证据的风险,并强调了需要采取具备检索意识的策略,以防止基于网络信息的系统陷入质量下降的自我强化循环。