Generative retrieval has recently been gaining a lot of attention from the research community for its simplicity, high performance, and the ability to fully leverage the power of deep autoregressive models. However, prior work on generative retrieval has mostly investigated on static benchmarks, while realistic retrieval applications often involve dynamic environments where knowledge is temporal and accumulated over time. In this paper, we introduce a new benchmark called STREAMINGIR, dedicated to quantifying the generalizability of retrieval methods to dynamically changing corpora derived from StreamingQA, that simulates realistic retrieval use cases. On this benchmark, we conduct an in-depth comparative evaluation of bi-encoder and generative retrieval in terms of performance as well as efficiency under varying degree of supervision. Our results suggest that generative retrieval shows (1) detrimental performance when only supervised data is used for fine-tuning, (2) superior performance over bi-encoders when only unsupervised data is available, and (3) lower performance to bi-encoders when both unsupervised and supervised data is used due to catastrophic forgetting; nevertheless, we show that parameter-efficient measures can effectively mitigate the issue and result in competitive performance and efficiency with respect to the bi-encoder baseline. Our results open up a new potential for generative retrieval in practical dynamic environments. Our work will be open-sourced.
翻译:生成式检索因其简洁性、高性能以及能够充分利用深度自回归模型的能力,近期引起了学术界的广泛关注。然而,以往关于生成式检索的研究大多聚焦于静态基准测试,而现实中的检索应用往往涉及动态环境,其中知识具有时间性且随时间累积。本文提出了一项名为STREAMINGIR的新基准,专门用于量化检索方法在源自StreamingQA的动态变化语料库上的泛化能力,该基准模拟了真实的检索应用场景。在此基准上,我们对双编码器与生成式检索在不同监督程度下的性能及效率进行了深入的比较评估。结果表明,生成式检索存在以下问题:(1)仅使用监督数据进行微调时性能受抑;(2)仅有无监督数据可用时性能优于双编码器;(3)同时使用无监督和监督数据时因灾难性遗忘导致性能低于双编码器。然而,我们证明了参数高效措施能有效缓解该问题,并在性能和效率上获得与双编码器基线相当的结果。本研究为生成式检索在实际动态环境中的应用开辟了新潜力,相关代码将开源。