Recent advances in Information Retrieval have leveraged high-dimensional embedding spaces to improve the retrieval of relevant documents. Moreover, the Manifold Clustering Hypothesis suggests that despite these high-dimensional representations, documents relevant to a query reside on a lower-dimensional, query-dependent manifold. While this hypothesis has inspired new retrieval methods, existing approaches still face challenges in effectively separating non-relevant information from relevant signals. We propose a novel methodology that addresses these limitations by leveraging information from both relevant and non-relevant documents. Our method, ECLIPSE, computes a centroid based on irrelevant documents as a reference to estimate noisy dimensions present in relevant ones, enhancing retrieval performance. Extensive experiments on three in-domain and one out-of-domain benchmarks demonstrate an average improvement of up to 19.50% (resp. 22.35%) in mAP(AP) and 11.42% (resp. 13.10%) in nDCG@10 w.r.t. the DIME-based baseline (resp. the baseline using all dimensions). Our results pave the way for more robust, pseudo-irrelevance-based retrieval systems in future IR research.
翻译:信息检索领域的最新进展利用高维嵌入空间来提升相关文档的检索效果。此外,流形聚类假说指出,尽管存在这些高维表示,与查询相关的文档实际上位于一个更低维、依赖于查询的流形上。虽然这一假说已启发新的检索方法,但现有方法在有效分离非相关信息与相关信号方面仍面临挑战。我们提出一种新颖的方法论,通过同时利用相关文档与非相关文档的信息来解决这些局限。我们的方法ECLIPSE基于无关文档计算一个质心作为参考,用以估计相关文档中存在的噪声维度,从而提升检索性能。在三个领域内基准和一个领域外基准上进行的大量实验表明,相较于基于DIME的基线(以及使用全部维度的基线),该方法在mAP(AP)上平均提升高达19.50%(对应22.35%),在nDCG@10上平均提升11.42%(对应13.10%)。我们的研究结果为未来信息检索领域构建更鲁棒的、基于伪无关反馈的检索系统铺平了道路。