Auto-regressive search engines emerge as a promising paradigm for next-gen information retrieval systems. These methods work with Seq2Seq models, where each query can be directly mapped to the identifier of its relevant document. As such, they are praised for merits like being end-to-end differentiable. However, auto-regressive search engines also confront challenges in retrieval quality, given the requirement for the exact generation of the document identifier. That's to say, the targeted document will be missed from the retrieval result if a false prediction about its identifier is made in any step of the generation process. In this work, we propose a novel framework, namely AutoTSG (Auto-regressive Search Engine with Term-Set Generation), which is featured by 1) the unordered term-based document identifier and 2) the set-oriented generation pipeline. With AutoTSG, any permutation of the term-set identifier will lead to the retrieval of the corresponding document, thus largely relaxing the requirement of exact generation. Besides, the Seq2Seq model is enabled to flexibly explore the optimal permutation of the document identifier for the presented query, which may further contribute to the retrieval quality. AutoTSG is empirically evaluated with Natural Questions and MS MARCO, where notable improvements can be achieved against the existing auto-regressive search engines.
翻译:自回归搜索引擎作为下一代信息检索系统的有前景范式崭露头角。这类方法基于Seq2Seq模型运作,每个查询可直接映射至相关文档的标识符。正因如此,它们因具备端到端可微分等优势而备受赞誉。然而,自回归搜索引擎在检索质量方面仍面临挑战——生成过程任一环节对文档标识符的错误预测都将导致目标文档从检索结果中遗漏。本文提出名为AutoTSG(基于术语集生成的自回归搜索引擎)的新型框架,其核心特色在于:1)无序的基于术语的文档标识符,以及2)面向集合的生成流程。借助AutoTSG,术语集标识符的任意排列均可实现对应文档的检索,从而大幅放宽精确生成的要求。此外,Seq2Seq模型能够灵活探索针对当前查询的最优文档标识符排列,这进一步提升了检索质量。在Natural Questions和MS MARCO数据集上的实证评估表明,AutoTSG相较于现有自回归搜索引擎取得了显著改进。