As artificial intelligence gains more and more popularity, vectors are one of the most widely used data structures for services such as information retrieval and recommendation. Approximate Nearest Neighbor Search (ANNS), which generally relies on indices optimized for fast search to organize large datasets, has played a core role in these popular services. As the frequency of data shift grows, it is crucial for indices to accommodate new data and support real-time updates. Existing researches adopting two different approaches hold the following drawbacks: 1) approaches using additional buffers to temporarily store new data are resource-intensive and inefficient due to the global rebuilding processes; 2) approaches upgrading the internal index structure suffer from performance degradation because of update congestion and imbalanced distribution in streaming workloads. In this paper, we propose UBIS, an Updatable Balanced Index for stable streaming similarity Search, to resolve conflicts by scheduling concurrent updates and maintain good index quality by reducing imbalanced update cases, when the update frequency grows. Experimental results in the real-world datasets demonstrate that UBIS achieves up to 77% higher search accuracy and 45% higher update throughput on average compared to the state-of-the-art indices in streaming workloads.
翻译:随着人工智能日益普及,向量已成为信息检索与推荐等服务中最广泛使用的数据结构之一。近似最近邻搜索通常依赖为快速检索优化的索引来组织大规模数据集,在这些主流服务中发挥着核心作用。随着数据更新频率的增长,索引必须能够容纳新数据并支持实时更新,这一点至关重要。现有研究采用两种不同方法,但均存在以下缺陷:1) 使用额外缓冲区临时存储新数据的方法因全局重建过程而资源密集且效率低下;2) 通过升级内部索引结构的方法在流式工作负载中因更新拥塞和分布不均衡而导致性能下降。本文提出UBIS——一种面向稳定流式相似性搜索的可更新平衡索引,该索引通过调度并发更新以解决冲突,并通过减少不均衡更新情况在更新频率增长时维持良好的索引质量。在真实数据集上的实验结果表明,与流式工作负载中的最先进索引相比,UBIS平均可实现高达77%的搜索精度提升和45%的更新吞吐量提升。