BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU

Approximate Nearest Neighbour Search (ANNS) is a subroutine in algorithms routinely employed in information retrieval, pattern recognition, data mining, image processing, and beyond. Recent works have established that graph-based ANNS algorithms are practically more efficient than the other methods proposed in the literature, on large datasets. The growing volume and dimensionality of data necessitates designing scalable techniques for ANNS. To this end, the prior art has explored parallelizing graph-based ANNS on GPU leveraging its high computational power and energy efficiency. The current state-of-the-art GPU-based ANNS algorithms either (i) require both the index-graph and the data to reside entirely in the GPU memory, or (ii) they partition the data into small independent shards, each of which can fit in GPU memory, and perform the search on these shards on the GPU. While the first approach fails to handle large datasets due to the limited memory available on the GPU, the latter delivers poor performance on large datasets due to high data traffic over the low-bandwidth PCIe bus. In this paper, we introduce BANG, a first-of-its-kind GPU-based ANNS method which works efficiently on billion-scale datasets that cannot entirely fit in the GPU memory. BANG stands out by harnessing compressed data on the GPU to perform distance computations while maintaining the graph on the CPU. BANG incorporates high-optimized GPU kernels and proceeds in stages that run concurrently on the GPU and CPU, taking advantage of their architectural specificities. We evaluate BANG using a single NVIDIA Ampere A100 GPU on ten popular ANN benchmark datasets. BANG outperforms the state-of-the-art in the majority of the cases. Notably, on the billion-size datasets, we are significantly faster than our competitors, achieving throughputs 40x-200x more than the competing methods for a high recall of 0.9.

翻译：摘要：近似最近邻搜索（ANNS）是信息检索、模式识别、数据挖掘、图像处理等算法中常用的子程序。近期研究表明，在大规模数据集上，基于图的ANNS算法在效率上实际优于文献中提出的其他方法。数据规模和维度的不断增长要求设计可扩展的ANNS技术。为此，现有研究已利用GPU的高计算能力和能效优势，探索了基于图的ANNS并行化方法。当前最先进的GPU基ANNS算法要么（i）要求索引图和数据完全驻留于GPU内存，要么（ii）将数据划分为可容纳于GPU内存的独立小分片，并在GPU上对这些分片执行搜索。然而，第一种方法因GPU可用内存有限而无法处理大规模数据集，第二种方法则因低带宽PCIe总线上的高数据传输量导致大规模数据集性能低下。本文提出BANG——首个能高效处理无法完全容纳于GPU内存的十亿级数据集的GPU基ANNS方法。BANG的独特之处在于利用GPU上的压缩数据进行距离计算，同时将图结构保留在CPU中。BANG集成了高度优化的GPU内核，并采用GPU与CPU并发执行的阶段化流程，充分利用两者的架构特性。我们在十个人工智能领域常用ANNS基准数据集上使用单块NVIDIA Ampere A100 GPU评估BANG。结果表明，BANG在多数情况下优于现有最先进方法。值得注意的是，在十亿级数据集上，我们的吞吐量是竞争方法的40–200倍，且达到0.9的高召回率。