Generative retrieval shed light on a new paradigm of document retrieval, aiming to directly generate the identifier of a relevant document for a query. While it takes advantage of bypassing the construction of auxiliary index structures, existing studies face two significant challenges: (i) the discrepancy between the knowledge of pre-trained language models and identifiers and (ii) the gap between training and inference that poses difficulty in learning to rank. To overcome these challenges, we propose a novel generative retrieval method, namely Generative retrieval via LExical iNdex learning (GLEN). For training, GLEN effectively exploits a dynamic lexical identifier using a two-phase index learning strategy, enabling it to learn meaningful lexical identifiers and relevance signals between queries and documents. For inference, GLEN utilizes collision-free inference, using identifier weights to rank documents without additional overhead. Experimental results prove that GLEN achieves state-of-the-art or competitive performance against existing generative retrieval methods on various benchmark datasets, e.g., NQ320k, MS MARCO, and BEIR. The code is available at https://github.com/skleee/GLEN.
翻译:生成式检索开启了一种全新的文档检索范式,旨在直接生成与查询相关的文档标识符。尽管该方法避免了辅助索引结构的构建,但现有研究面临两大挑战:(i) 预训练语言模型知识与标识符之间的不一致性;(ii) 训练与推理之间的差距导致排序学习困难。为克服这些挑战,我们提出一种新型生成式检索方法,即基于词汇索引学习的生成式检索(GLEN)。训练阶段,GLEN通过两阶段索引学习策略有效利用动态词汇标识符,使其能够学习有意义的词汇标识符以及查询与文档间的相关性信号。推理阶段,GLEN采用无冲突推理机制,利用标识符权重对文档进行排序,无需额外开销。实验结果表明,在NQ320k、MS MARCO和BEIR等多个基准数据集上,GLEN在现有生成式检索方法中达到了最先进或具有竞争力的性能。代码已开源至https://github.com/skleee/GLEN。