Managing large-scale vector datasets with disk-based approximate nearest neighbor search (ANNS) systems faces critical efficiency challenges stemming from the co-location of vector data and auxiliary index metadata. Our analysis of state-of-the-art ANNS systems reveals that such co-location incurs substantial storage overhead, generates excessive reads during search queries, and causes severe write amplification during updates. We present DecoupleVS, a decoupled vector storage management framework that enables specialized optimizations for vector data and auxiliary index metadata. DecoupleVS incorporates various design techniques for effective compression, data layouts, search queries, and updates, so as to significantly reduce storage space, while maintaining high search and update performance and high search accuracy. Evaluation on real-world public and proprietary billion-scale datasets shows that DecoupleVS reduces storage space by up to 58.7\%, while delivering competitive or improved search query and update performance, compared to state-of-the-art monolithic disk-based ANNS systems.
翻译:管理基于磁盘的近似最近邻搜索(ANNS)系统的大规模向量数据集面临着关键效率挑战,这些挑战源于向量数据与辅助索引元数据的共同存储。我们对最先进的ANNS系统的分析表明,这种共同存储会导致显著的存储开销,在搜索查询期间产生过多的读取操作,并在更新过程中引起严重的写放大现象。我们提出DecoupleVS,一种解耦的向量存储管理框架,可实现对向量数据和辅助索引元数据的专门优化。DecoupleVS整合了多种设计技术,用于有效压缩、数据布局、搜索查询和更新,从而在保持高搜索与更新性能及高搜索精度的同时,大幅减少存储空间。在真实世界公共和专有的十亿级数据集上的评估表明,与最先进的单一式基于磁盘的ANNS系统相比,DecoupleVS可减少高达58.7%的存储空间,同时提供具有竞争力或更优的搜索查询与更新性能。