Vector Retrieval with Similarity and Diversity: How Hard Is It?

Dense vector retrieval is an important building block of modern machine learning systems, underlying applications ranging from semantic search to retrieval-augmented generation and knowledge-intensive reasoning. Beyond retrieving items that are individually similar to a query, many applications require a set of results that is also diverse, complementary, and collectively informative. Balancing similarity and diversity is therefore central to effective retrieval, but remains challenging to optimize in a stable and theoretically grounded way. Maximal Marginal Relevance (MMR) is a widely adopted heuristic for this problem, yet its reliance on a manually tuned parameter leads to optimization fluctuations and unpredictable retrieval results. More broadly, existing methods provide limited theoretical insight into how similarity and diversity interact in dense vector spaces, leaving the joint optimization problem insufficiently understood. To address these challenges, this paper introduces a novel approach that characterizes both constraints simultaneously by maximizing the similarity between the query vector and the sum of the selected candidate vectors. We formally define this optimization problem, Vector Retrieval with Similarity and Diversity (VRSD), and prove that it is NP-complete, establishing a rigorous theoretical bound on the inherent difficulty of this dual-objective retrieval. Subsequently, we present a parameter-free heuristic algorithm to solve VRSD. Extensive evaluations on multiple datasets, incorporating both objective geometric metrics and LLM-simulated subjective assessments, demonstrate that our VRSD heuristic consistently outperforms established baselines, including MMR and Determinantal Point Processes (k-DPP).

翻译：密集向量检索是现代机器学习系统的重要基石，支撑着从语义搜索到检索增强生成及知识密集型推理等多种应用。除了检索与查询个体相似的项目外，许多应用还需要结果集具备多样性、互补性和整体信息丰富性。因此，权衡相似性与多样性是高效检索的核心，但在稳定且有理论依据的优化中仍具挑战性。最大边际相关性（MMR）是解决该问题的广泛采用的启发式方法，但其对人工调节参数的依赖导致优化波动和不可预测的检索结果。更广泛而言，现有方法对密集向量空间中相似性与多样性的交互机制提供的理论见解有限，使得联合优化问题尚未得到充分理解。为解决这些挑战，本文提出了一种新方法，通过最大化查询向量与所选候选向量之和的相似性来同时刻画两个约束条件。我们正式定义了该优化问题——相似性与多样性向量检索（VRSD），并证明其为NP完全问题，从而建立了这一双目标检索内在难度的严格理论界限。随后，我们提出了一种无参数启发式算法来求解VRSD。在多个数据集上的广泛评估，结合客观几何指标和LLM模拟的主观评估，表明我们的VRSD启发式方法始终优于包括MMR和行列式点过程（k-DPP）在内的现有基线方法。