Generative retrieval, a promising new paradigm in information retrieval, employs a seq2seq model to encode document features into parameters and decode relevant document identifiers (IDs) based on search queries. Existing generative retrieval solutions typically rely on a preprocessing stage to pre-define document IDs, which can suffer from a semantic gap between these IDs and the retrieval task. However, end-to-end training for both ID assignments and retrieval tasks is challenging due to the long-tailed distribution characteristics of real-world data, resulting in inefficient and unbalanced ID space utilization. To address these issues, we propose ASI++, a novel fully end-to-end generative retrieval method that aims to simultaneously learn balanced ID assignments and improve retrieval performance. ASI++ builds on the fully end-to-end training framework of vanilla ASI and introduces several key innovations. First, a distributionally balanced criterion addresses the imbalance in ID assignments, promoting more efficient utilization of the ID space. Next, a representation bottleneck criterion enhances dense representations to alleviate bottlenecks in learning ID assignments. Finally, an information consistency criterion integrates these processes into a joint optimization framework grounded in information theory. We further explore various module structures for learning ID assignments, including neural quantization, differentiable product quantization, and residual quantization. Extensive experiments on both public and industrial datasets demonstrate the effectiveness of ASI++ in improving retrieval performance and achieving balanced ID assignments.
翻译:生成式检索作为信息检索领域一种前景广阔的新范式,采用序列到序列模型将文档特征编码至参数中,并根据搜索查询解码相关文档标识符(ID)。现有生成式检索方案通常依赖预处理阶段预定义文档ID,这可能存在ID与检索任务间的语义鸿沟。然而,由于现实数据的长尾分布特性,对ID分配和检索任务进行端到端训练具有挑战性,会导致ID空间利用率低下且不均衡。为解决这些问题,我们提出ASI++,一种新颖的完全端到端生成式检索方法,旨在同时学习平衡的ID分配并提升检索性能。ASI++基于原始ASI的完全端到端训练框架,引入了若干关键创新:首先,分布平衡准则处理ID分配的不均衡问题,促进ID空间的高效利用;其次,表示瓶颈准则增强稠密表示以缓解ID分配学习中的瓶颈效应;最后,信息一致性准则基于信息理论将这些过程整合为联合优化框架。我们进一步探索了学习ID分配的不同模块结构,包括神经量化、可微乘积量化和残差量化。在公开数据集和工业数据集上的大量实验证明了ASI++在提升检索性能和实现平衡ID分配方面的有效性。