Generative retrieval (GR) has emerged as a new paradigm in neural information retrieval, offering an alternative to dense retrieval (DR) by directly generating identifiers of relevant documents. In this paper, we theoretically and empirically investigate how GR fundamentally diverges from DR in both learning objectives and representational capacity. GR performs globally normalized maximum-likelihood optimization and encodes corpus and relevance information directly in the model parameters, whereas DR adopts locally normalized objectives and represents the corpus with external embeddings before computing similarity via a bilinear interaction. Our analysis suggests that, under scaling, GR can overcome the inherent limitations of DR, yielding two major benefits. First, with larger corpora, GR avoids the sharp performance degradation caused by the optimization drift induced by DR's local normalization. Second, with larger models, GR's representational capacity scales with parameter size, unconstrained by the global low-rank structure that limits DR. We validate these theoretical insights through controlled experiments on the Natural Questions and MS MARCO datasets, across varying negative sampling strategies, embedding dimensions, and model scales. But despite its theoretical advantages, GR does not universally outperform DR in practice. We outline directions to bridge the gap between GR's theoretical potential and practical performance, providing guidance for future research in scalable and robust generative retrieval.


翻译:生成式检索(GR)作为一种神经信息检索的新范式,通过直接生成相关文档的标识符,为稠密检索(DR)提供了替代方案。本文从理论和实证两方面探讨了GR在学习目标和表示能力上与DR的根本差异。GR执行全局归一化的最大似然优化,并将语料库和相关性信息直接编码到模型参数中;而DR采用局部归一化目标,通过外部嵌入表示语料库,再经由双线性交互计算相似度。我们的分析表明,在规模扩展下,GR能够克服DR的固有局限,带来两大优势:首先,面对更大规模的语料库时,GR可避免因DR的局部归一化引发的优化漂移所导致的性能急剧下降;其次,随着模型规模扩大,GR的表示能力随参数规模同步增长,不受限于DR所面临的全局低秩结构约束。我们在Natural Questions和MS MARCO数据集上,通过控制负采样策略、嵌入维度和模型规模的对比实验验证了这些理论见解。然而,尽管GR具有理论优势,但在实际应用中并未全面超越DR。我们指出了弥合GR理论潜力与实际性能之间差距的研究方向,为未来可扩展且鲁棒的生成式检索研究提供指引。

0
下载
关闭预览

相关内容

ACM/IEEE第23届模型驱动工程语言和系统国际会议,是模型驱动软件和系统工程的首要会议系列,由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来,模型涵盖了建模的各个方面,从语言和方法到工具和应用程序。模特的参加者来自不同的背景,包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛,参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会,并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。 官网链接:http://www.modelsconference.org/
Top
微信扫码咨询专知VIP会员