Popularized by the Differentiable Search Index, the emerging paradigm of generative retrieval re-frames the classic information retrieval problem into a sequence-to-sequence modeling task, forgoing external indices and encoding an entire document corpus within a single Transformer. Although many different approaches have been proposed to improve the effectiveness of generative retrieval, they have only been evaluated on document corpora on the order of 100k in size. We conduct the first empirical study of generative retrieval techniques across various corpus scales, ultimately scaling up to the entire MS MARCO passage ranking task with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters. We uncover several findings about scaling generative retrieval to millions of passages; notably, the central importance of using synthetic queries as document representations during indexing, the ineffectiveness of existing proposed architecture modifications when accounting for compute cost, and the limits of naively scaling model parameters with respect to retrieval performance. While we find that generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge. We believe these findings will be valuable for the community to clarify the current state of generative retrieval, highlight the unique challenges, and inspire new research directions.
翻译:由可微搜索索引推广的新兴生成式检索范式,将经典信息检索问题重新定义为序列到序列建模任务,摒弃外部索引并将整个文档语料库编码到单个Transformer中。尽管已有许多方法被提出以提升生成式检索的有效性,但它们的评估均局限于规模约10万量级的文档语料库。我们首次对跨不同语料库规模的生成式检索技术进行实证研究,最终将规模扩展至包含880万段落的完整MS MARCO段落排序任务,并评估了高达110亿参数的模型规模。我们发现了关于生成式检索扩展到百万级语料库的若干重要结论:尤其是索引过程中使用合成查询作为文档表示的核心重要性、现有架构修改方案在考虑计算成本时的低效性,以及单纯扩大模型参数对检索性能提升的局限性。虽然我们发现生成式检索在小型语料库上能与最先进的双编码器相抗衡,但扩展至百万级语料库仍是一个重大且未解决的挑战。我们相信这些发现将有助于学界明晰生成式检索的当前发展状况、突出其独特挑战,并激发新的研究方向。