Dense vector retrieval is essential for semantic queries within Natural Language Processing, particularly in knowledge-intensive applications like Retrieval-Augmented Generation (RAG). The ability to retrieve vectors that satisfy both similarity and diversity substantially enhances system performance. Although the Maximal Marginal Relevance (MMR) algorithm is widely used to balance these objectives, its reliance on a manually tuned parameter leads to optimization fluctuations and unpredictable retrieval results. Furthermore, there is a lack of sufficient theoretical analysis on the joint optimization of similarity and diversity in vector retrieval. To address these challenges, this paper introduces a novel approach that characterizes both constraints simultaneously by maximizing the similarity between the query vector and the sum of the selected candidate vectors. We formally define this optimization problem, Vectors Retrieval with Similarity and Diversity (VRSD) , and prove that it is NP-complete, establishing a rigorous theoretical bound on the inherent difficulty of this dual-objective retrieval. Subsequently, we present a parameter-free heuristic algorithm to solve VRSD. Extensive evaluations on multiple scientific QA datasets , incorporating both objective geometric metrics and LLM-simulated subjective assessments, demonstrate that our VRSD heuristic consistently outperforms established baselines, including MMR and Determinantal Point Processes (k-DPP).
翻译:密集向量检索对于自然语言处理中的语义查询至关重要,尤其在检索增强生成(RAG)等知识密集型应用中。检索出同时满足相似性与多样性要求的向量能显著提升系统性能。尽管最大边际相关性(MMR)算法被广泛用于平衡这两个目标,但其对人工调谐参数的依赖会导致优化波动和不可预测的检索结果。此外,目前对向量检索中相似性与多样性联合优化的理论分析尚不充分。为应对这些挑战,本文提出一种新方法,通过最大化查询向量与所选候选向量之和的相似性来同时刻画这两个约束。我们正式定义了该优化问题——兼顾相似性与多样性的向量检索(VRSD),并证明其为NP完全问题,从而为这一双目标检索的内在难度建立了严格的理论界限。随后,我们提出了一种无参数启发式算法来求解VRSD。在多个科学问答数据集上进行广泛评估,结合客观几何度量与LLM模拟的主观评估,结果表明我们的VRSD启发式算法持续优于现有基线方法,包括MMR和行列式点过程(k-DPP)。