Generative retrieval is a promising new paradigm in text retrieval that generates identifier strings of relevant passages as the retrieval target. This paradigm leverages powerful generative language models, distinct from traditional sparse or dense retrieval methods. In this work, we identify a viable direction to further enhance generative retrieval via distillation and propose a feasible framework, named DGR. DGR utilizes sophisticated ranking models, such as the cross-encoder, in a teacher role to supply a passage rank list, which captures the varying relevance degrees of passages instead of binary hard labels; subsequently, DGR employs a specially designed distilled RankNet loss to optimize the generative retrieval model, considering the passage rank order provided by the teacher model as labels. This framework only requires an additional distillation step to enhance current generative retrieval systems and does not add any burden to the inference stage. We conduct experiments on four public datasets, and the results indicate that DGR achieves state-of-the-art performance among the generative retrieval methods. Additionally, DGR demonstrates exceptional robustness and generalizability with various teacher models and distillation losses.
翻译:生成式检索是文本检索领域一种具有前景的新范式,其以生成相关段落的标识符字符串作为检索目标。该范式利用强大的生成式语言模型,区别于传统的稀疏或稠密检索方法。本文中,我们识别出通过蒸馏进一步增强生成式检索的可行方向,并提出一个名为DGR的可行框架。DGR利用跨编码器等复杂排序模型作为教师角色,提供段落排名列表,该列表捕捉段落的不同相关程度,而非二元硬标签;随后,DGR采用专门设计的蒸馏RankNet损失函数,以教师模型提供的段落排序顺序作为标签来优化生成式检索模型。该框架仅需额外的蒸馏步骤即可增强现有生成式检索系统,且不增加推理阶段负担。我们在四个公共数据集上进行实验,结果表明DGR在生成式检索方法中达到最优性能。此外,DGR在不同教师模型和蒸馏损失下展现出卓越的鲁棒性与泛化能力。