Approximate Nearest Neighbor Search (ANNS) is now widely used in various applications, ranging from information retrieval, question answering, and recommendation, to search for similar high-dimensional vectors. As the amount of vector data grows continuously, it becomes important to support updates to vector index, the enabling technique that allows for efficient and accurate ANNS on vectors. Because of the curse of high dimensionality, it is often costly to identify the right neighbors of a single new vector, a necessary process for index update. To amortize update costs, existing systems maintain a secondary index to accumulate updates, which are merged by the main index by global rebuilding the entire index periodically. However, this approach has high fluctuations of search latency and accuracy, not even to mention that it requires substantial resources and is extremely time-consuming for rebuilds. We introduce SPFresh, a system that supports in-place vector updates. At the heart of SPFresh is LIRE, a lightweight incremental rebalancing protocol to split vector partitions and reassign vectors in the nearby partitions to adapt to data distribution shift. LIRE achieves low-overhead vector updates by only reassigning vectors at the boundary between partitions, where in a high-quality vector index the amount of such vectors are deemed small. With LIRE, SPFresh provides superior query latency and accuracy to solutions based on global rebuild, with only 1% of DRAM and less than 10% cores needed at the peak compared to the state-of-the-art, in a billion scale vector index with 1% of daily vector update rate.
翻译:近似最近邻搜索(ANNS)目前已广泛应用于信息检索、问答系统、推荐系统等多种需要搜索相似高维向量的场景中。随着向量数据量的持续增长,支持向量索引的更新变得至关重要,因为索引是实现高效准确ANNS的关键技术。由于高维度的“维度灾难”,为单个新向量确定其正确邻居(这是索引更新必需的过程)通常计算成本高昂。为分摊更新开销,现有系统通过维护二级索引来累积更新,并依赖主索引定期全局重建整个索引以完成合并。然而,这种方法会导致搜索延迟和准确性的剧烈波动,更不用说全局重建本身需要大量资源且极其耗时。本文介绍SPFresh,一个支持向量原位更新的系统。其核心是LIRE,一种轻量级的增量再平衡协议,通过分裂向量分区并重新分配相邻分区中的向量来适应数据分布的变化。LIRE仅需在分区边界处重新分配向量(在高质量的向量索引中,此类向量的数量通常很少),从而实现低开销的向量更新。借助LIRE,SPFresh在十亿规模、日更新率为1%的向量索引上,相比基于全局重建的方案,提供了更优的查询延迟与准确性,且峰值时仅需最先进方案1%的DRAM内存和不到10%的计算核心。