Generative retrieval, which is a new advanced paradigm for document retrieval, has recently attracted research interests, since it encodes all documents into the model and directly generates the retrieved documents. However, its power is still underutilized since it heavily relies on the "preprocessed" document identifiers (docids), thus limiting its retrieval performance and ability to retrieve new documents. In this paper, we propose a novel fully end-to-end retrieval paradigm. It can not only end-to-end learn the best docids for existing and new documents automatically via a semantic indexing module, but also perform end-to-end document retrieval via an encoder-decoder-based generative model, namely Auto Search Indexer (ASI). Besides, we design a reparameterization mechanism to combine the above two modules into a joint optimization framework. Extensive experimental results demonstrate the superiority of our model over advanced baselines on both public and industrial datasets and also verify the ability to deal with new documents.
翻译:生成式检索是一种新兴的高级文档检索范式,它通过将所有文档编码到模型中并直接生成检索结果,近期引起了研究关注。然而,由于该范式严重依赖“预处理”的文档标识符(docids),其能力尚未得到充分利用,从而限制了检索性能和检索新文档的能力。本文提出了一种全新的完全端到端检索范式。该范式不仅能通过语义索引模块自动为现有文档和新文档端到端学习最优的文档标识符,还能通过基于编码器-解码器的生成模型(即自动搜索索引器,ASI)实现端到端的文档检索。此外,我们设计了一种重参数化机制,将上述两个模块整合到一个联合优化框架中。大量实验结果表明,我们的模型在公开数据集和工业数据集上均优于先进的基线模型,并且验证了其处理新文档的能力。