Dense embedding-based retrieval is now the industry standard for semantic search and ranking problems, like obtaining relevant web documents for a given query. Such techniques use a two-stage process: (a) contrastive learning to train a dual encoder to embed both the query and documents and (b) approximate nearest neighbor search (ANNS) for finding similar documents for a given query. These two stages are disjoint; the learned embeddings might be ill-suited for the ANNS method and vice-versa, leading to suboptimal performance. In this work, we propose End-to-end Hierarchical Indexing -- EHI -- that jointly learns both the embeddings and the ANNS structure to optimize retrieval performance. EHI uses a standard dual encoder model for embedding queries and documents while learning an inverted file index (IVF) style tree structure for efficient ANNS. To ensure stable and efficient learning of discrete tree-based ANNS structure, EHI introduces the notion of dense path embedding that captures the position of a query/document in the tree. We demonstrate the effectiveness of EHI on several benchmarks, including de-facto industry standard MS MARCO (Dev set and TREC DL19) datasets. For example, with the same compute budget, EHI outperforms state-of-the-art (SOTA) in by 0.6% (MRR@10) on MS MARCO dev set and by 4.2% (nDCG@10) on TREC DL19 benchmarks.
翻译:基于稠密嵌入的检索技术现已成为语义搜索与排序问题的行业标准(例如,为给定查询获取相关Web文档)。此类技术采用两阶段流程:(a) 通过对比学习训练双编码器,将查询与文档嵌入同一空间;(b) 利用近似最近邻搜索(ANNS)为给定查询寻找相似文档。这两个阶段相互独立:学习得到的嵌入可能不适用于ANNS方法,反之亦然,从而导致性能次优。本文提出端到端层次化索引方法——EHI——联合学习嵌入表示与ANNS结构以优化检索性能。EHI在采用标准双编码器模型嵌入查询与文档的同时,学习基于倒排文件索引(IVF)风格的树形结构以实现高效ANNS。为确保离散树形ANNS结构的稳定高效学习,EHI引入稠密路径嵌入概念,用于捕获查询/文档在树中的位置信息。我们在多个基准测试(包括行业事实标准MS MARCO开发集与TREC DL19数据集)上验证了EHI的有效性。例如,在相同计算预算下,EHI在MS MARCO开发集上的MRR@10指标超越当前最优方法(SOTA)0.6%,在TREC DL19基准测试上的nDCG@10指标提升4.2%。