Model Editing for New Document Integration in Generative Information Retrieval

Generative retrieval (GR) reformulates the Information Retrieval (IR) task as the generation of document identifiers (docIDs). Despite its promise, existing GR models exhibit poor generalization to newly added documents, often failing to generate the correct docIDs. While incremental training offers a straightforward remedy, it is computationally expensive, resource-intensive, and prone to catastrophic forgetting, thereby limiting the scalability and practicality of GR. In this paper, we identify the core bottleneck as the decoder's ability to map hidden states to the correct docIDs of newly added documents. Model editing, which enables targeted parameter modifications for docID mapping, represents a promising solution. However, applying model editing to current GR models is not trivial, which is severely hindered by indistinguishable edit vectors across queries, due to the high overlap of shared docIDs in retrieval results. To address this, we propose DOME (docID-oriented model editing), a novel method that effectively and efficiently adapts GR models to unseen documents. DOME comprises three stages: (1) identification of critical layers, (2) optimization of edit vectors, and (3) construction and application of updates. At its core, DOME employs a hybrid-label adaptive training strategy that learns discriminative edit vectors by combining soft labels, which preserve query-specific semantics for distinguishable updates, with hard labels that enforce precise mapping modifications. Experiments on widely used benchmarks, including NQ and MS MARCO, show that our method significantly improves retrieval performance on new documents while maintaining effectiveness on the original collection. Moreover, DOME achieves this with only about 60% of the training time required by incremental training, considerably reducing computational cost and enabling efficient, frequent model updates.

翻译：生成式检索（Generative Retrieval, GR）将信息检索（IR）任务重构为文档标识符（docID）的生成过程。尽管具有发展前景，现有GR模型对新增文档的泛化能力较差，常常无法生成正确的文档标识符。虽然增量训练提供了一种直接的补救方法，但其计算成本高、资源密集且易引发灾难性遗忘，从而限制了GR的可扩展性和实用性。本文中，我们确定核心瓶颈在于解码器将隐藏状态映射至新增文档正确文档标识符的能力。模型编辑能够对文档标识符映射进行有针对性的参数修改，因此是一种有前景的解决方案。然而，将模型编辑应用于现有GR模型并非易事，检索结果中共享文档标识符的高度重叠导致查询间的编辑向量难以区分，严重阻碍了该方法的实施。为解决这一问题，我们提出DOME（面向文档标识符的模型编辑）——一种有效且高效地使GR模型适配未见文档的新方法。DOME包含三个阶段：(1) 关键层识别，(2) 编辑向量优化，(3) 更新构建与应用。其核心采用混合标签自适应训练策略：通过结合软标签（保留查询特定语义以实现可区分性更新）与硬标签（强制执行精确映射修改）来学习可区分的编辑向量。在包括NQ和MS MARCO在内的广泛使用基准测试上的实验表明，我们的方法在保持原始集合检索性能的同时，显著提升了新文档的检索效果。此外，DOME仅需增量训练约60%的训练时间，大幅降低了计算成本，实现了高效且频繁的模型更新。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

迈向可信的检索增强生成：大语言模型综述

专知会员服务

30+阅读 · 2025年2月12日

图增强生成（GraphRAG）

专知会员服务

35+阅读 · 2025年1月4日

生成式信息检索综述

专知会员服务

35+阅读 · 2024年6月5日

大模型如何做检索？WWW2024教程《生成式信息检索》附115页ppt

专知会员服务

35+阅读 · 2024年5月21日