We propose Partition Dimensions Across (PDX), a data layout for vectors (e.g., embeddings) that, similar to PAX [6], stores multiple vectors in one block, using a vertical layout for the dimensions (Figure 1). PDX accelerates exact and approximate similarity search thanks to its dimension-by-dimension search strategy that operates on multiple-vectors-at-a-time in tight loops. It beats SIMD-optimized distance kernels on standard horizontal vector storage (avg 40% faster), only relying on scalar code that gets auto-vectorized. We combined the PDX layout with recent dimension-pruning algorithms ADSampling [19] and BSA [52] that accelerate approximate vector search. We found that these algorithms on the horizontal vector layout can lose to SIMD-optimized linear scans, even if they are SIMD-optimized. However, when used on PDX, their benefit is restored to 2-7x. We find that search on PDX is especially fast if a limited number of dimensions has to be scanned fully, which is what the dimension-pruning approaches do. We finally introduce PDX-BOND, an even more flexible dimension-pruning strategy, with good performance on exact search and reasonable performance on approximate search. Unlike previous pruning algorithms, it can work on vector data "as-is" without preprocessing; making it attractive for vector databases with frequent updates.
翻译:我们提出了一种面向向量(例如嵌入向量)的数据布局——跨维度分区(PDX)。与PAX [6]类似,PDX将多个向量存储在一个数据块中,并采用垂直布局来组织维度(见图1)。PDX通过其逐维度搜索策略,能够在紧凑循环中一次性处理多个向量,从而加速精确与近似相似性搜索。该布局仅依赖可自动向量化的标量代码,性能即超越基于标准水平向量存储的SIMD优化距离计算内核(平均快40%)。我们将PDX布局与近期提出的维度剪枝算法ADSampling [19]和BSA [52]相结合,这些算法旨在加速近似向量搜索。研究发现,这些算法在水平向量布局上即使经过SIMD优化,其性能仍可能逊于SIMD优化的线性扫描。然而,当应用于PDX布局时,其性能优势得以恢复,达到2-7倍的加速。我们发现,当仅需对有限数量的维度进行完整扫描时(这正是维度剪枝方法的工作方式),PDX上的搜索尤其快速。最后,我们提出了PDX-BOND,这是一种更为灵活的维度剪枝策略,在精确搜索上表现优异,在近似搜索上也具有合理性能。与先前的剪枝算法不同,它能够直接处理“原始”向量数据而无需预处理,这使其对于需要频繁更新的向量数据库具有吸引力。