Generative retrieval (GR) reformulates the Information Retrieval (IR) task as the generation of document identifiers (docIDs). Despite its promise, existing GR models exhibit poor generalization to newly added documents, often failing to generate the correct docIDs. While incremental training offers a straightforward remedy, it is computationally expensive, resource-intensive, and prone to catastrophic forgetting, thereby limiting the scalability and practicality of GR. In this paper, we identify the core bottleneck as the decoder's ability to map hidden states to the correct docIDs of newly added documents. Model editing, which enables targeted parameter modifications for docID mapping, represents a promising solution. However, applying model editing to current GR models is not trivial, which is severely hindered by indistinguishable edit vectors across queries, due to the high overlap of shared docIDs in retrieval results. To address this, we propose DOME (docID-oriented model editing), a novel method that effectively and efficiently adapts GR models to unseen documents. DOME comprises three stages: (1) identification of critical layers, (2) optimization of edit vectors, and (3) construction and application of updates. At its core, DOME employs a hybrid-label adaptive training strategy that learns discriminative edit vectors by combining soft labels, which preserve query-specific semantics for distinguishable updates, with hard labels that enforce precise mapping modifications. Experiments on widely used benchmarks, including NQ and MS MARCO, show that our method significantly improves retrieval performance on new documents while maintaining effectiveness on the original collection. Moreover, DOME achieves this with only about 60% of the training time required by incremental training, considerably reducing computational cost and enabling efficient, frequent model updates.
翻译:生成式检索(Generative Retrieval, GR)将信息检索(IR)任务重构为文档标识符(docID)的生成过程。尽管具有发展前景,现有GR模型对新增文档的泛化能力较差,常常无法生成正确的文档标识符。虽然增量训练提供了一种直接的补救方法,但其计算成本高、资源密集且易引发灾难性遗忘,从而限制了GR的可扩展性和实用性。本文中,我们确定核心瓶颈在于解码器将隐藏状态映射至新增文档正确文档标识符的能力。模型编辑能够对文档标识符映射进行有针对性的参数修改,因此是一种有前景的解决方案。然而,将模型编辑应用于现有GR模型并非易事,检索结果中共享文档标识符的高度重叠导致查询间的编辑向量难以区分,严重阻碍了该方法的实施。为解决这一问题,我们提出DOME(面向文档标识符的模型编辑)——一种有效且高效地使GR模型适配未见文档的新方法。DOME包含三个阶段:(1) 关键层识别,(2) 编辑向量优化,(3) 更新构建与应用。其核心采用混合标签自适应训练策略:通过结合软标签(保留查询特定语义以实现可区分性更新)与硬标签(强制执行精确映射修改)来学习可区分的编辑向量。在包括NQ和MS MARCO在内的广泛使用基准测试上的实验表明,我们的方法在保持原始集合检索性能的同时,显著提升了新文档的检索效果。此外,DOME仅需增量训练约60%的训练时间,大幅降低了计算成本,实现了高效且频繁的模型更新。