Dense embedding-based retrieval is widely used for semantic search and ranking. However, conventional two-stage approaches, involving contrastive embedding learning followed by approximate nearest neighbor search (ANNS), can suffer from misalignment between these stages. This mismatch degrades retrieval performance. We propose End-to-end Hierarchical Indexing (EHI), a novel method that directly addresses this issue by jointly optimizing embedding generation and ANNS structure. EHI leverages a dual encoder for embedding queries and documents while simultaneously learning an inverted file index (IVF)-style tree structure. To facilitate the effective learning of this discrete structure, EHI introduces dense path embeddings that encodes the path traversed by queries and documents within the tree. Extensive evaluations on standard benchmarks, including MS MARCO (Dev set) and TREC DL19, demonstrate EHI's superiority over traditional ANNS index. Under the same computational constraints, EHI outperforms existing state-of-the-art methods by +1.45% in MRR@10 on MS MARCO (Dev) and +8.2% in nDCG@10 on TREC DL19, highlighting the benefits of our end-to-end approach.
翻译:基于稠密嵌入的检索被广泛应用于语义搜索和排序。然而,传统的两阶段方法——先进行对比嵌入学习,再进行近似最近邻搜索——可能面临两个阶段之间不匹配的问题。这种不匹配会降低检索性能。我们提出了端到端分层索引(EHI),这是一种通过联合优化嵌入生成和ANNS结构来直接解决该问题的新方法。EHI利用双编码器分别生成查询和文档的嵌入,同时学习一种倒排文件索引(IVF)风格的树状结构。为了有效学习这种离散结构,EHI引入了稠密路径嵌入,用于编码查询和文档在树中遍历的路径。在包括MS MARCO(开发集)和TREC DL19在内的标准基准上进行的大量评估表明,EHI优于传统的ANNS索引。在相同的计算约束下,EHI在MS MARCO(开发集)上的MRR@10指标优于现有最先进方法+1.45%,在TREC DL19上的nDCG@10指标优于+8.2%,这凸显了我们端到端方法的优势。