Multi-Vector Similarity Search is essential for fine-grained semantic retrieval in many real-world applications, offering richer representations than traditional single-vector paradigms. Due to the lack of native multi-vector index, existing methods rely on a filter-and-refine framework built upon single-vector indexes. By treating token vectors within each multi-vector object in isolation and ignoring their correlations, these methods face an inherent dilemma: aggressive filtering sacrifices recall, while conservative filtering incurs prohibitive computational cost during refinement. To address this limitation, we propose MV-HNSW, the first native hierarchical graph index designed for multi-vector data. MV-HNSW introduces a novel edge-weight function that satisfies essential properties (symmetry, cardinality robustness, and query consistency) for graph-based indexing, an accelerated multi-vector similarity computation algorithm, and an augmented search strategy that dynamically discovers topologically disconnected yet relevant candidates. Extensive experiments on seven real-world datasets show that MV-HNSW achieves state-of-the-art search performance, maintaining over 90% recall while reducing search latency by up to 14.0$\times$ compared to existing methods.
翻译:多向量相似性搜索在许多实际应用场景中对于细粒度语义检索至关重要,其相比传统单向量范式提供了更丰富的表示能力。由于缺乏原生多向量索引,现有方法依赖于基于单向量索引构建的"过滤-精炼"框架。通过孤立处理每个多向量对象中的词元向量并忽略其相关性,这些方法面临固有困境:激进过滤会牺牲召回率,而保守过滤在精炼阶段会带来难以承受的计算成本。为解决这一局限,我们提出MV-HNSW——首个专为多向量数据设计的原生分层图索引。MV-HNSW引入了满足图索引关键属性(对称性、基数鲁棒性及查询一致性)的新型边权函数、加速的多向量相似性计算算法,以及能动态发现拓扑不连通但相关候选节点的增强搜索策略。在七个真实数据集上的大量实验表明,MV-HNSW实现了最先进的搜索性能,在保持超过90%召回率的同时,相比现有方法将搜索延迟降低了最高14.0倍。