GPU-Accelerated ANNS: Quantized for Speed, Built for Change

Approximate nearest neighbor search (ANNS) is a core problem in machine learning and information retrieval applications. GPUs offer a promising path to high-performance ANNS: they provide massive parallelism for distance computations, are readily available, and can co-locate with downstream applications. Despite these advantages, current GPU-accelerated ANNS systems face three key limitations. First, real-world applications operate on evolving datasets that require fast batch updates, yet most GPU indices must be rebuilt from scratch when new data arrives. Second, high-dimensional vectors strain memory bandwidth, but current GPU systems lack efficient quantization techniques that reduce data movement without introducing costly random memory accesses. Third, the data-dependent memory accesses inherent to greedy search make overlapping compute and memory difficult, leading to reduced performance. We present Jasper, a GPU-native ANNS system with both high query throughput and updatability. Jasper builds on the Vamana graph index and overcomes existing bottlenecks via three contributions: (1) a CUDA batch-parallel construction algorithm that enables lock-free streaming insertions, (2) a GPU-efficient implementation of RaBitQ quantization that reduces memory footprint up to 8x without the random access penalties, and (3) an optimized greedy search kernel that increases compute utilization, resulting in better latency hiding and higher throughput. Our evaluation across five datasets shows that Jasper achieves up to 1.93x higher query throughput than CAGRA and achieves up to 80% peak utilization as measured by the roofline model. Jasper's construction scales efficiently and constructs indices an average of 2.4x faster than CAGRA while providing updatability that CAGRA lacks. Compared to BANG, the previous fastest GPU Vamana implementation, Jasper delivers 19-131x faster queries.

翻译：近似最近邻搜索（ANNS）是机器学习和信息检索应用中的核心问题。GPU为实现高性能ANNS提供了可行路径：它们为距离计算提供海量并行能力、易于获取，且可与下游应用协同部署。尽管具备这些优势，当前GPU加速的ANNS系统仍面临三个关键局限。首先，实际应用常处理动态演化的数据集，需要快速批量更新，而多数GPU索引在新数据到达时必须完全重建。其次，高维向量对内存带宽造成压力，但现有GPU系统缺乏高效的量化技术，难以在不引入高代价随机内存访问的前提下减少数据移动。第三，贪婪搜索固有的数据依赖性内存访问使得计算与内存重叠困难，导致性能下降。本文提出Jasper——一个兼具高查询吞吐量与可更新性的原生GPU ANNS系统。Jasper基于Vamana图索引，通过三项创新突破现有瓶颈：（1）支持无锁流式插入的CUDA批量并行构建算法；（2）GPU高效的RaBitQ量化实现，可在避免随机访问惩罚的同时将内存占用降低达8倍；（3）优化的贪婪搜索内核，通过提升计算利用率实现更好的延迟隐藏与更高吞吐。我们在五个数据集上的评估表明：Jasper相比CAGRA实现最高1.93倍的查询吞吐提升，并通过屋顶模型测得最高80%的峰值利用率；其构建过程具备高效扩展性，平均比CAGRA快2.4倍，同时提供CAGRA缺失的可更新能力；与此前最快的GPU Vamana实现BANG相比，Jasper的查询速度快19-131倍。