The Differentiable Search Index (DSI) is an emerging paradigm for information retrieval. Unlike traditional retrieval architectures where index and retrieval are two different and separate components, DSI uses a single transformer model to perform both indexing and retrieval. In this paper, we identify and tackle an important issue of current DSI models: the data distribution mismatch that occurs between the DSI indexing and retrieval processes. Specifically, we argue that, at indexing, current DSI methods learn to build connections between the text of long documents and the identifier of the documents, but then retrieval of document identifiers is based on queries that are commonly much shorter than the indexed documents. This problem is further exacerbated when using DSI for cross-lingual retrieval, where document text and query text are in different languages. To address this fundamental problem of current DSI models, we propose a simple yet effective indexing framework for DSI, called DSI-QG. When indexing, DSI-QG represents documents with a number of potentially relevant queries generated by a query generation model and re-ranked and filtered by a cross-encoder ranker. The presence of these queries at indexing allows the DSI models to connect a document identifier to a set of queries, hence mitigating data distribution mismatches present between the indexing and the retrieval phases. Empirical results on popular mono-lingual and cross-lingual passage retrieval datasets show that DSI-QG significantly outperforms the original DSI model.
翻译:差分搜索索引(DSI)是一种新兴的信息检索范式。与传统检索架构中索引和检索作为两个独立组件不同,DSI使用单一的Transformer模型同时执行索引与检索。本文识别并解决了当前DSI模型的一个关键问题:DSI索引与检索过程中存在的数据分布失配。具体而言,我们认为在索引阶段,现有DSI方法学习建立长文档文本与文档标识符之间的关联,但随后基于查询的文档标识符检索通常远短于索引文档。当DSI用于跨语言检索时,文档文本与查询文本分属不同语言,这一问题会进一步加剧。为解决当前DSI模型的这一根本问题,我们提出一种简单而有效的DSI索引框架,称为DSI-QG。在索引阶段,DSI-QG通过查询生成模型生成若干潜在相关查询,并经由交叉编码器排序器重排序与筛选后,以这些查询表示文档。在索引时引入这些查询,使DSI模型能够将文档标识符与一组查询建立关联,从而缓解索引与检索阶段之间的数据分布失配问题。在主流单语言与跨语言段落检索数据集上的实验结果表明,DSI-QG显著优于原始DSI模型。