Graph-based indexing is the dominant approach for approximate nearest neighbor search in vector databases, offering high recall with low latency across billions of vectors. However, in such indices, the edge set of the proximity graph is only modified to reflect changes in the indexed data, never to adapt to the query workload. This is wasteful: real-world query streams exhibit strong spatial and temporal locality, yet every query must re-traverse the same intermediate hops from fixed or random entry points. We present CatapultDB, a lightweight mechanism that, for the first time, dynamically determines where to begin the search in an ANN index on the fly, therefore exploiting query locality. CatapultDB injects shortcut edges called catapults that connect query regions to frequently visited destination nodes. Catapults are maintained as an additional layer on top of the graph, so the standard vector search algorithm remains unchanged: queries are simply routed to a better starting point when an appropriate catapult exists. This transparent design preserves the full feature set of the underlying system, including filtered search, dynamic insertions, and disk-resident indices. We implement CatapultDB and evaluate it using four workloads with varying amounts of bias. Our experiments show that CatapultDB increases throughput by up to 2.51x compared to DiskANN at equivalent or better recall, matches the efficiency of LSH-based approaches without sacrificing filtering or requiring index reconstruction, and adapts gracefully to workload shifts, unlike cache-based alternatives.
翻译:基于图的索引是向量数据库中近似最近邻搜索的主流方法,能在数十亿向量上实现高召回率与低延迟。然而,此类索引中邻近图的边集仅随索引数据的变化而修改,从未针对查询负载进行自适应调整。这造成了资源浪费:现实世界的查询流表现出强烈的空间与时间局部性,但每条查询仍必须从固定或随机的入口点重新遍历相同的中继跳数。本文提出CatapultDB,一种轻量级机制,首次在近似最近邻索引中动态确定搜索起始位置,从而利用查询局部性。CatapultDB注入称为“弹射器”的快捷边,将查询区域与频繁访问的目标节点相连接。弹射器作为附加层维护在图结构之上,因此标准向量搜索算法保持不变:当存在合适弹射器时,查询仅被路由至更优的起始点。这种透明设计完整保留了底层系统的全部功能特性,包括过滤搜索、动态插入及磁盘驻留索引。我们实现了CatapultDB,并在四种具有不同偏差程度的工作负载上进行了评估。实验表明:在保持同等或更高召回率的前提下,CatapultDB相比DiskANN将吞吐量提升最高达2.51倍;在不牺牲过滤功能或重建索引的情况下,其效率与基于LSH的方法相当;且能优雅适应工作负载变化,这与基于缓存的替代方案形成鲜明对比。