GPU-Native Approximate Nearest Neighbor Search with IVF-RaBitQ: Fast Index Build and Search

Approximate nearest neighbor search (ANNS) on GPUs is gaining increasing popularity for modern retrieval and recommendation workloads that operate over massive high-dimensional vectors. Graph-based indexes deliver high recall and throughput but incur heavy build-time and storage costs. In contrast, cluster-based methods build and scale efficiently yet often need many probes for high recall, straining memory bandwidth and compute. Aiming to simultaneously achieve fast index build, high-throughput search, high recall, and low storage requirement for GPUs, we present IVF-RaBitQ (GPU), a GPU-native ANNS solution that integrates the cluster-based method IVF with RaBitQ quantization into an efficient GPU index build/search pipeline. Specifically, for index build, we develop a scalable GPU-native RaBitQ quantization method that enables fast and accurate low-bit encoding at scale. For search, we develop GPU-native distance computation schemes for RaBitQ codes and a fused search kernel to achieve high throughput with high recall. With IVF-RaBitQ implemented and integrated into the NVIDIA cuVS Library, experiments on cuVS Bench across multiple datasets show that IVF-RaBitQ offers a strong performance frontier in recall, throughput, index build time, and storage footprint. For Recall approximately equal to 0.95, IVF-RaBitQ achieves 2.2x higher QPS than the state-of-the-art graph-based method CAGRA, while also constructing indices 7.7x faster on average. Compared to the cluster-based method IVF-PQ, IVF-RaBitQ delivers on average over 2.7x higher throughput while avoiding accessing the raw vectors for reranking.

翻译：GPU上的近似最近邻搜索（ANNS）在处理海量高维向量的现代检索与推荐任务中日益普及。基于图的索引方法虽能实现高召回率与高吞吐量，但存在索引构建耗时与存储开销巨大的问题。相比之下，基于聚类的方法具备高效的构建与扩展能力，但为达到高召回率通常需要多次探查，对内存带宽与计算资源造成压力。为在GPU平台上同步实现快速索引构建、高吞吐量检索、高召回率与低存储需求，本文提出IVF-RaBitQ（GPU）——一种原生GPU的ANNS解决方案，它将基于聚类的IVF方法与RaBitQ量化技术集成至高效的GPU索引构建/检索流水线中。具体而言，在索引构建方面，我们开发了可扩展的原生GPU RaBitQ量化方法，支持大规模快速精确的低比特编码。在检索方面，我们设计了针对RaBitQ编码的原生GPU距离计算方案，并开发了融合检索内核以实现高召回率下的高吞吐量。通过将IVF-RaBitQ实现并集成至NVIDIA cuVS库，在cuVS基准测试平台的多组数据集实验表明：IVF-RaBitQ在召回率、吞吐量、索引构建时间与存储占用方面均展现出强劲的性能优势。当召回率约为0.95时，IVF-RaBitQ的每秒查询量（QPS）达到当前最先进的基于图方法CAGRA的2.2倍，同时索引构建速度平均提升7.7倍。相较于基于聚类的方法IVF-PQ，IVF-RaBitQ在避免访问原始向量进行重排序的前提下，平均实现超过2.7倍的吞吐量提升。