Generative retrieval (GR) reformulates the Information Retrieval (IR) task as the generation of document identifiers (docIDs). Despite its promise, existing GR models exhibit poor generalization to newly added documents, often failing to generate the correct docIDs. While incremental training offers a straightforward remedy, it is computationally expensive, resource-intensive, and prone to catastrophic forgetting, thereby limiting the scalability and practicality of GR. In this paper, we identify the core bottleneck as the decoder's ability to map hidden states to the correct docIDs of newly added documents. Model editing, which enables targeted parameter modifications for docID mapping, represents a promising solution. However, applying model editing to current GR models is not trivial, which is severely hindered by indistinguishable edit vectors across queries, due to the high overlap of shared docIDs in retrieval results. To address this, we propose DOME (docID-oriented model editing), a novel method that effectively and efficiently adapts GR models to unseen documents. DOME comprises three stages: (1) identification of critical layers, (2) optimization of edit vectors, and (3) construction and application of updates. At its core, DOME employs a hybrid-label adaptive training strategy that learns discriminative edit vectors by combining soft labels, which preserve query-specific semantics for distinguishable updates, with hard labels that enforce precise mapping modifications. Experiments on widely used benchmarks, including NQ and MS MARCO, show that our method significantly improves retrieval performance on new documents while maintaining effectiveness on the original collection. Moreover, DOME achieves this with only about 60% of the training time required by incremental training, considerably reducing computational cost and enabling efficient, frequent model updates.
翻译:生成式检索(GR)将信息检索(IR)任务重新定义为文档标识符(docID)的生成任务。尽管前景广阔,但现有的GR模型在泛化到新添加文档时表现不佳,常常无法生成正确的docID。虽然增量训练提供了一种直接的补救措施,但其计算成本高昂、资源密集,且容易发生灾难性遗忘,从而限制了GR的可扩展性和实用性。在本文中,我们确定其核心瓶颈在于解码器将隐藏状态映射到新添加文档正确docID的能力。模型编辑能够针对docID映射进行定向参数修改,代表了一种有前景的解决方案。然而,将模型编辑应用于当前的GR模型并非易事,由于检索结果中共享docID的高度重叠,导致跨查询的编辑向量难以区分,这严重阻碍了其应用。为解决此问题,我们提出了DOME(面向docID的模型编辑),这是一种新颖的方法,能够有效且高效地使GR模型适应未见过的文档。DOME包含三个阶段:(1)关键层识别,(2)编辑向量优化,以及(3)更新构建与应用。其核心在于,DOME采用了一种混合标签自适应训练策略,该策略通过结合软标签(保留查询特定语义以实现可区分的更新)与硬标签(强制精确的映射修改)来学习具有区分性的编辑向量。在广泛使用的基准测试(包括NQ和MS MARCO)上的实验表明,我们的方法显著提高了对新文档的检索性能,同时保持了在原始文档集上的有效性。此外,DOME仅需增量训练所需训练时间的大约60%即可实现此目标,大大降低了计算成本,并支持高效、频繁的模型更新。