A Real-Time Adaptive Multi-Stream GPU System for Online Approximate Nearest Neighborhood Search

In recent years, Approximate Nearest Neighbor Search (ANNS) has played a pivotal role in modern search and recommendation systems, especially in emerging LLM applications like Retrieval-Augmented Generation. There is a growing exploration into harnessing the parallel computing capabilities of GPUs to meet the substantial demands of ANNS. However, existing systems primarily focus on offline scenarios, overlooking the distinct requirements of online applications that necessitate real-time insertion of new vectors. This limitation renders such systems inefficient for real-world scenarios. Moreover, previous architectures struggled to effectively support real-time insertion due to their reliance on serial execution streams. In this paper, we introduce a novel Real-Time Adaptive Multi-Stream GPU ANNS System (RTAMS-GANNS). Our architecture achieves its objectives through three key advancements: 1) We initially examined the real-time insertion mechanisms in existing GPU ANNS systems and discovered their reliance on repetitive copying and memory allocation, which significantly hinders real-time effectiveness on GPUs. As a solution, we introduce a dynamic vector insertion algorithm based on memory blocks, which includes in-place rearrangement. 2) To enable real-time vector insertion in parallel, we introduce a multi-stream parallel execution mode, which differs from existing systems that operate serially within a single stream. Our system utilizes a dynamic resource pool, allowing multiple streams to execute concurrently without additional execution blocking. 3) Through extensive experiments and comparisons, our approach effectively handles varying QPS levels across different datasets, reducing latency by up to 40%-80%. The proposed system has also been deployed in real-world industrial search and recommendation systems, serving hundreds of millions of users daily, and has achieved good results.

翻译：近年来，近似最近邻搜索（ANNS）在现代搜索与推荐系统中发挥着关键作用，尤其是在检索增强生成等新兴大语言模型应用中。为满足ANNS的巨大计算需求，利用GPU并行计算能力的研究日益增多。然而，现有系统主要面向离线场景，忽视了在线应用需要实时插入新向量的特殊需求，导致其在实际场景中效率低下。此外，由于依赖串行执行流，先前架构难以有效支持实时插入。本文提出一种新型实时自适应多流GPU ANNS系统（RTAMS-GANNS）。该架构通过三项关键创新实现目标：1）首先剖析现有GPU ANNS系统的实时插入机制，发现其依赖重复复制与内存分配的操作模式严重制约GPU实时性能。为此，我们提出基于内存块的动态向量插入算法，支持原位重排操作。2）为实现并行化实时向量插入，我们设计多流并行执行模式。与现有单流串行系统不同，本系统采用动态资源池机制，允许多个流无阻塞并发执行。3）通过大量实验对比验证，本方法能有效应对不同数据集的动态查询负载，将延迟降低40%-80%。所提系统已部署于实际工业级搜索推荐系统，每日服务数亿用户并取得良好效果。