Generative retrieval seeks to replace traditional search index data structures with a single large-scale neural network, offering the potential for improved efficiency and seamless integration with generative large language models. As an end-to-end paradigm, generative retrieval adopts a learned differentiable search index to conduct retrieval by directly generating document identifiers through corpus-specific constrained decoding. The generalization capabilities of generative retrieval on out-of-distribution corpora have gathered significant attention. In this paper, we examine the inherent limitations of constrained auto-regressive generation from two essential perspectives: constraints and beam search. We begin with the Bayes-optimal setting where the generative retrieval model exactly captures the underlying relevance distribution of all possible documents. Then we apply the model to specific corpora by simply adding corpus-specific constraints. Our main findings are two-fold: (i) For the effect of constraints, we derive a lower bound of the error, in terms of the KL divergence between the ground-truth and the model-predicted step-wise marginal distributions. (ii) For the beam search algorithm used during generation, we reveal that the usage of marginal distributions may not be an ideal approach. This paper aims to improve our theoretical understanding of the generalization capabilities of the auto-regressive decoding retrieval paradigm, laying a foundation for its limitations and inspiring future advancements toward more robust and generalizable generative retrieval.
翻译:生成式检索旨在用单一的大规模神经网络替代传统的搜索索引数据结构,从而有望提升效率并与生成式大语言模型无缝集成。作为一种端到端范式,生成式检索采用可学习的可微分搜索索引,通过基于特定语料库的约束解码直接生成文档标识符来完成检索。生成式检索在分布外语料库上的泛化能力已引起广泛关注。本文从约束和束搜索两个基本视角,探讨了约束自回归生成的内在局限性。我们首先考虑贝叶斯最优设定,即生成式检索模型精确捕获所有可能文档的底层相关性分布,随后通过简单添加语料库特定约束将该模型应用于具体语料库。我们的主要发现包括两方面:(一)关于约束的影响,我们推导了误差的下界,该下界以真实分布与模型预测的逐步边缘分布之间的KL散度表示。(二)针对生成过程中使用的束搜索算法,我们揭示了使用边缘分布可能并非理想方案。本文旨在深化对自回归解码检索范式泛化能力的理论理解,为其局限性奠定分析基础,并启发未来朝着更稳健、更可泛化的生成式检索方向推进。