GPU-Resident Inverted File Index for Streaming Vector Databases

Vector search has emerged as the computational backbone of modern AI infrastructure, powering critical systems ranging from Vector Databases to Retrieval-Augmented Generation (RAG). While the GPU-accelerated Inverted File (IVF) index acts as one of the most widely used techniques for these large-scale workloads due to its memory efficiency, its traditional architecture remains fundamentally static. Existing designs rely on rigid and contiguous memory layouts that lack native support for in-place mutation, creating a severe bottleneck for streaming scenarios. In applications requiring real-time knowledge updates, such as live recommendation engines or dynamic RAG systems, maintaining index freshness necessitates expensive CPU-GPU roundtrips that cause system latency to spike from milliseconds to seconds. In this paper, we propose SIVF (Streaming Inverted File), a new GPU-native architecture designed to empower vector databases with high-velocity data ingestion and deletion capabilities. SIVF replaces the static memory layout with a slab-based allocation system and a validity bitmap, enabling lock-free and in-place mutation directly in VRAM. We further introduce a GPU-resident address translation table (ATT) to resolve the overhead of locating vectors, providing $O(1)$ access to physical storage slots. We evaluate SIVF against the industry-standard GPU IVF implementation on the SIFT1M and GIST1M datasets. Microbenchmarks demonstrate that SIVF reduces deletion latency by up to $13,300\times$ (from 11.8 seconds to 0.89 ms on GIST1M) and improves ingestion throughput by $36\times$ to $105\times$. In end-to-end sliding window scenarios, SIVF eliminates system freezes and achieves a $161\times$ to $266\times$ speedup with single-digit millisecond latency. Notably, this performance incurs negligible storage penalty, maintaining less than 0.8\% memory overhead compared to static indices.

翻译：向量搜索已成为现代人工智能基础设施的计算支柱，为从向量数据库到检索增强生成（RAG）等关键系统提供动力。虽然GPU加速的倒排文件（IVF）索引因其内存效率而成为处理此类大规模工作负载最广泛使用的技术之一，但其传统架构从根本上仍然是静态的。现有设计依赖于僵化且连续的内存布局，缺乏对原位变异的原生支持，这在流式场景中造成了严重的瓶颈。在需要实时知识更新的应用中，例如实时推荐引擎或动态RAG系统，保持索引新鲜度需要昂贵的CPU-GPU往返传输，导致系统延迟从毫秒级激增至秒级。本文提出SIVF（流式倒排文件），一种新的GPU原生架构，旨在赋予向量数据库高速数据摄取与删除能力。SIVF采用基于内存块的分配系统和有效性位图取代了静态内存布局，从而支持在显存中直接进行无锁的原位变异。我们进一步引入一个GPU驻留的地址转换表（ATT）来解析定位向量的开销，提供对物理存储槽的$O(1)$访问。我们在SIFT1M和GIST1M数据集上，将SIVF与行业标准的GPU IVF实现进行了对比评估。微观基准测试表明，SIVF将删除延迟降低了高达$13,300\times$（在GIST1M上从11.8秒降至0.89毫秒），并将摄取吞吐量提升了$36\times$至$105\times$。在端到端的滑动窗口场景中，SIVF消除了系统冻结，实现了$161\times$至$266\times$的加速，并保持了个位数毫秒级的延迟。值得注意的是，此性能带来的存储开销可忽略不计，与静态索引相比，内存开销保持在0.8%以下。