BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU

Approximate Nearest Neighbour Search (ANNS) is a subroutine in algorithms routinely employed in information retrieval, pattern recognition, data mining, image processing, and beyond. Recent works have established that graph-based ANNS algorithms are practically more efficient than the other methods proposed in the literature. The growing volume and dimensionality of data necessitates designing scalable techniques for ANNS. To this end, the prior art has explored parallelizing graph-based ANNS on GPU leveraging its massive parallelism. The current state-of-the-art GPU-based ANNS algorithms either (i) require both the dataset and the generated graph index to reside entirely in the GPU memory, or (ii) they partition the dataset into small independent shards, each of which can fit in GPU memory, and perform the search on these shards on the GPU. While the first approach fails to handle large datasets due to the limited memory available on the GPU, the latter delivers poor performance on large datasets due to high data traffic over the low-bandwidth PCIe bus. We introduce BANG, a first-of-its-kind technique for graph-based ANNS on GPU for billion-scale datasets that cannot entirely fit in the GPU memory. BANG stands out by harnessing a compressed form of the dataset on a single GPU to perform distance computations while efficiently accessing the graph index kept on the host memory, enabling efficient ANNS on large graphs within the limited GPU memory. BANG incorporates highly optimized GPU kernels and proceeds in phases that run concurrently on the GPU and CPU. Notably, on the billion-size datasets, we achieve throughputs 40x-200x more than the competing methods for a high recall value of 0.9. Additionally, BANG is the best in cost- and power-efficiency among the competing methods from the recent Billion-Scale Approximate Nearest Neighbour Search Challenge.

翻译：近似最近邻搜索（ANNS）是信息检索、模式识别、数据挖掘、图像处理等领域常规算法中的子程序。近期研究表明，基于图的ANNS算法在实践中比文献中提出的其他方法更为高效。数据量和维度的不断增长要求设计可扩展的ANNS技术。为此，现有技术已探索利用GPU的大规模并行性实现基于图的ANNS并行化。当前最先进的基于GPU的ANNS算法存在两种局限：（i）要求数据集和生成的图索引完全驻留在GPU内存中，或（ii）将数据集分割为可独立装入GPU内存的小型分片，并在GPU上对这些分片执行搜索。前者因GPU有限的内存容量无法处理大规模数据集，后者则因低带宽PCIe总线上的高数据流量导致大规模数据集上的性能低下。本文提出BANG——首个针对无法完全装入GPU内存的十亿级数据集的GPU端基于图ANNS技术。BANG的创新之处在于：通过在单GPU上使用压缩形式的数据集执行距离计算，同时高效访问驻留于主机内存的图索引，从而在有限GPU内存内实现大规模图的高效ANNS。BANG集成了高度优化的GPU内核，并通过在GPU和CPU上并发执行的多阶段流程进行处理。值得注意的是，在十亿级数据集上，我们实现了比竞争方法高40-200倍的吞吐量（召回率阈值为0.9）。此外，在近期"十亿级近似最近邻搜索挑战赛"中，BANG在成本效益与能效方面均优于所有竞争方法。