VRSD: Rethinking Similarity and Diversity for Retrieval in Large Language Models

Vector retrieval algorithms are vital for semantic queries in the evolving landscape of Large Language Models (LLMs). Retrieving vectors that simultaneously meet criteria for both similarity and diversity significantly enhances the capabilities of LLM-based agents. Despite the widespread use of the Maximal Marginal Relevance (MMR) in retrieval scenarios with relevance and diversity requirements, fluctuations caused by variations in the parameter $ \lambda $ within the MMR complicate the determination of the optimization trajectory in vector spaces, thus obscuring the direction of enhancement. Moreover, there is a lack of a robust theoretical analysis for the constraints of similarity and diversity in retrieval processes. This paper introduces a novel approach to characterizing both constraints through the relationship between the sum vector and the query vector. The proximity of these vectors addresses the similarity constraint, while necessitating that individual vectors within the sum vector divergently align with the query vector to satisfy the diversity constraint. We also formulate a new combinatorial optimization challenge, taking a selection of $k$ vectors from a set of candidates such that their sum vector maximally aligns with the query vector, a problem we demonstrate to be NP-complete. This establishes the profound difficulty of pursuing similarity and diversity simultaneously in vector retrieval and lays a theoretical groundwork for further research. Additionally, we present the heuristic algorithm Vectors Retrieval with Similarity and Diversity (VRSD) which not only has a definitive optimization goal and eschews the need for preset parameters but also offers a modest reduction in time complexity compared to MMR. Empirical validation further confirm that VRSD significantly surpasses MMR across various datasets.

翻译：向量检索算法在大型语言模型（LLMs）不断发展的语义查询领域中至关重要。检索同时满足相似性与多样性标准的向量，能显著增强基于LLM的智能体的能力。尽管最大边际相关性（MMR）在兼顾相关性与多样性的检索场景中被广泛使用，但其参数$ \lambda $的变化所引起的波动，使得向量空间中优化轨迹的确定变得复杂，从而模糊了改进方向。此外，目前缺乏对检索过程中相似性与多样性约束的坚实理论分析。本文提出了一种通过和向量与查询向量之间的关系来刻画这两种约束的新方法：两向量的接近程度对应相似性约束，而要求构成和向量的各个向量与查询向量呈发散性对齐则满足多样性约束。我们还构建了一个新的组合优化问题，即从候选向量集中选择$k$个向量，使得它们的和向量与查询向量最大程度对齐，我们证明了该问题是NP完全的。这确立了在向量检索中同时追求相似性与多样性的深刻困难，并为后续研究奠定了理论基础。此外，我们提出了启发式算法VRSD（兼顾相似性与多样性的向量检索），该算法不仅具有明确的优化目标且无需预设参数，与MMR相比时间复杂度也略有降低。实证结果进一步证实，VRSD在多个数据集上均显著优于MMR。