Modern information retrieval (IR) is no longer consumed primarily by humans but increasingly by large language models (LLMs) via retrieval-augmented generation (RAG) and agentic search. Unlike human users, LLMs are constrained by limited attention budgets and are uniquely vulnerable to noise; misleading or irrelevant information is no longer just a nuisance, but a direct cause of hallucinations and reasoning failures. In this perspective paper, we argue that denoising-maximizing usable evidence density and verifiability within a context window-is becoming the primary bottleneck across the full information access pipeline. We conceptualize this paradigm shift through a four-stage framework of IR challenges: from inaccessible to undiscoverable, to misaligned, and finally to unverifiable. Furthermore, we provide a pipeline-organized taxonomy of signal-to-noise optimization techniques, spanning indexing, retrieval, context engineering, verification, and agentic workflow. We also present research works on information denoising in domains that rely heavily on retrieval such as lifelong assistant, coding agent, deep research, and multimodal understanding.
翻译:现代信息检索(IR)不再主要由人类直接消费,而是越来越多地通过检索增强生成(RAG)和智能体搜索被大型语言模型(LLM)所使用。与人类用户不同,LLM受限于有限的注意力预算,并且对噪声特别敏感;误导性或无关的信息不再仅仅是烦扰,而是直接导致幻觉和推理失败的原因。在这篇视角论文中,我们认为去噪——即最大化上下文窗口内可使用的证据密度和可验证性——正在成为整个信息访问管道的首要瓶颈。我们通过一个四阶段IR挑战框架来概念化这一范式转变:从不可访问到不可发现,再到不匹配,最后到不可验证。此外,我们提供了一个按管道组织的信噪比优化技术分类体系,涵盖索引、检索、上下文工程、验证和智能体工作流。我们还介绍了在高度依赖检索的领域(如终身助手、编码智能体、深度研究和多模态理解)中关于信息去噪的研究工作。