Generative retrieval (GR) directly predicts the identifiers of relevant documents (i.e., docids) based on a parametric model. It has achieved solid performance on many ad-hoc retrieval tasks. So far, these tasks have assumed a static document collection. In many practical scenarios, however, document collections are dynamic, where new documents are continuously added to the corpus. The ability to incrementally index new documents while preserving the ability to answer queries with both previously and newly indexed relevant documents is vital to applying GR models. In this paper, we address this practical continual learning problem for GR. We put forward a novel Continual-LEarner for generatiVE Retrieval (CLEVER) model and make two major contributions to continual learning for GR: (i) To encode new documents into docids with low computational cost, we present Incremental Product Quantization, which updates a partial quantization codebook according to two adaptive thresholds; and (ii) To memorize new documents for querying without forgetting previous knowledge, we propose a memory-augmented learning mechanism, to form meaningful connections between old and new documents. Empirical results demonstrate the effectiveness and efficiency of the proposed model.
翻译:生成式检索(GR)直接基于参数化模型预测相关文档的标识符(即docid)。它在许多特定检索任务中已取得稳定性能。迄今为止,这些任务均假设文档集合是静态的。然而,在实际场景中,文档集合往往是动态的,新文档会持续添加到语料库中。增量索引新文档的能力,同时保持对先前索引和新索引相关文档的查询能力,对于应用GR模型至关重要。本文针对GR中的这一实际持续学习问题展开研究。我们提出了一种新颖的生成式检索持续学习模型(CLEVER),并在GR持续学习方面做出两项主要贡献:(i)为以低计算成本将新文档编码为docid,我们提出增量乘积量化方法,该方法根据两个自适应阈值更新部分量化码本;(ii)为在查询时记忆新文档而不遗忘先前知识,我们提出一种记忆增强学习机制,以在旧文档与新文档之间建立有意义的关联。实证结果证明了所提模型的有效性和高效性。