Recently, a new paradigm called Differentiable Search Index (DSI) has been proposed for document retrieval, wherein a sequence-to-sequence model is learned to directly map queries to relevant document identifiers. The key idea behind DSI is to fully parameterize traditional ``index-retrieve'' pipelines within a single neural model, by encoding all documents in the corpus into the model parameters. In essence, DSI needs to resolve two major questions: (1) how to assign an identifier to each document, and (2) how to learn the associations between a document and its identifier. In this work, we propose a Semantic-Enhanced DSI model (SE-DSI) motivated by Learning Strategies in the area of Cognitive Psychology. Our approach advances original DSI in two ways: (1) For the document identifier, we take inspiration from Elaboration Strategies in human learning. Specifically, we assign each document an Elaborative Description based on the query generation technique, which is more meaningful than a string of integers in the original DSI; and (2) For the associations between a document and its identifier, we take inspiration from Rehearsal Strategies in human learning. Specifically, we select fine-grained semantic features from a document as Rehearsal Contents to improve document memorization. Both the offline and online experiments show improved retrieval performance over prevailing baselines.
翻译:最近,一种名为可微分搜索索引(DSI)的新范式被提出用于文档检索,其中序列到序列模型被学习用于直接将查询映射到相关文档标识符。DSI的核心思想是将传统的"索引-检索"流程完全参数化到单个神经模型中,通过将语料库中的所有文档编码到模型参数中实现。本质上,DSI需要解决两个主要问题:(1)如何为每个文档分配标识符,以及(2)如何学习文档与其标识符之间的关联。在本文中,我们受到认知心理学领域学习策略的启发,提出了一种语义增强DSI模型(SE-DSI)。我们的方法在两个方面对原始DSI进行了改进:(1)对于文档标识符,我们从人类学习的精加工策略中汲取灵感。具体而言,我们基于查询生成技术为每个文档分配一个精加工描述,这比原始DSI中的整数序列更具语义意义;(2)对于文档与其标识符之间的关联,我们从人类学习的复述策略中汲取灵感。具体而言,我们从文档中选取细粒度语义特征作为复述内容,以增强文档记忆。离线和在线实验均表明,与主流基线相比,我们的方法在检索性能上有所提升。