Recently, model-based retrieval has emerged as a new paradigm in text retrieval that discards the index in the traditional retrieval model and instead memorizes the candidate corpora using model parameters. This design employs a sequence-to-sequence paradigm to generate document identifiers, which enables the complete capture of the relevance between queries and documents and simplifies the classic indexretrieval-rerank pipeline. Despite its attractive qualities, there remain several major challenges in model-based retrieval, including the discrepancy between pre-training and fine-tuning, and the discrepancy between training and inference. To deal with the above challenges, we propose a novel two-stage model-based retrieval approach called TOME, which makes two major technical contributions, including the utilization of tokenized URLs as identifiers and the design of a two-stage generation architecture. We also propose a number of training strategies to deal with the training difficulty as the corpus size increases. Extensive experiments and analysis on MS MARCO and Natural Questions demonstrate the effectiveness of our proposed approach, and we investigate the scaling laws of TOME by examining various influencing factors.
翻译:最近,模型驱动检索作为文本检索领域的新范式兴起,该方法摒弃了传统检索模型中的索引机制,转而利用模型参数记忆候选语料库。这种设计采用序列到序列范式生成文档标识符,能够完整捕获查询与文档间的相关性,并简化了经典的"索引-检索-重排序"流程。尽管具有上述优势,模型驱动检索仍面临预训练与微调之间的差异以及训练与推理之间的差异等重大挑战。为应对上述挑战,我们提出了一种名为TOME的新型两阶段模型驱动检索方法,其两大核心技术贡献包括:采用分词化的URL作为标识符,以及设计两阶段生成架构。我们还提出了一系列训练策略来应对语料库规模扩大带来的训练困难。在MS MARCO和Natural Questions上的大量实验与分析验证了所提方法的有效性,并通过考察多种影响因素探究了TOME的缩放定律。