Generative retrieval is a promising new paradigm in text retrieval that generates identifier strings of relevant passages as the retrieval target. This paradigm leverages powerful generation models and represents a new paradigm distinct from traditional learning-to-rank methods. However, despite its rapid development, current generative retrieval methods are still limited. They typically rely on a heuristic function to transform predicted identifiers into a passage rank list, which creates a gap between the learning objective of generative retrieval and the desired passage ranking target. Moreover, the inherent exposure bias problem of text generation also persists in generative retrieval. To address these issues, we propose a novel framework, called LTRGR, that combines generative retrieval with the classical learning-to-rank paradigm. Our approach involves training an autoregressive model using a passage rank loss, which directly optimizes the autoregressive model toward the optimal passage ranking. This framework only requires an additional training step to enhance current generative retrieval systems and does not add any burden to the inference stage. We conducted experiments on three public datasets, and our results demonstrate that LTRGR achieves state-of-the-art performance among generative retrieval methods, indicating its effectiveness and robustness.
翻译:生成式检索是文本检索领域一种前景广阔的新范式,它以生成相关段落的标识符字符串作为检索目标。该范式利用了强大的生成模型,并代表了一种与传统的学习排序方法截然不同的新范式。然而,尽管其发展迅速,当前的生成式检索方法仍然存在局限。它们通常依赖启发式函数将预测的标识符转换为段落排序列表,这在生成式检索的学习目标与期望的段落排序目标之间造成了差距。此外,文本生成中固有的暴露偏差问题在生成式检索中依然存在。为解决这些问题,我们提出了一种名为LTRGR的新框架,它将生成式检索与经典学习排序范式相结合。我们的方法涉及使用段落排序损失训练自回归模型,该损失直接针对最优段落排序来优化自回归模型。该框架仅需增加一个训练步骤来增强当前的生成式检索系统,且不会给推理阶段增加任何负担。我们在三个公开数据集上进行了实验,结果表明,LTRGR在生成式检索方法中实现了最先进的性能,证明了其有效性和鲁棒性。