Vector databases have become a cornerstone of modern information retrieval, powering applications in recommendation, search, and retrieval-augmented generation (RAG) pipelines. However, scaling approximate nearest neighbor (ANN) search to high recall under strict latency SLOs remains fundamentally constrained by memory capacity and I/O bandwidth. Disk-based vector search systems suffer severe latency degradation at high accuracy, while fully in-memory solutions incur prohibitive memory costs at billion-scale. Despite the central role of caching in traditional databases, vector search lacks a general query-level caching layer capable of amortizing repeated query work. We present QVCache, the first backend-agnostic, query-level caching system for ANN search with bounded memory footprint. QVCache exploits semantic query repetition by performing similarity-aware caching rather than exact-match lookup. It dynamically learns region-specific distance thresholds using an online learning algorithm, enabling recall-preserving cache hits while bounding lookup latency and memory usage independently of dataset size. QVCache operates as a drop-in layer for existing vector databases. It maintains a megabyte-scale memory footprint and achieves sub-millisecond cache-hit latency, reducing end-to-end query latency by up to 40-1000x when integrated with existing ANN systems. For workloads exhibiting temporal-semantic locality, QVCache substantially reduces latency while preserving recall comparable to the underlying ANN backend, establishing it as a missing but essential caching layer for scalable vector search.
翻译:向量数据库已成为现代信息检索的基石,广泛应用于推荐、搜索和检索增强生成(RAG)流程中。然而,在严格的延迟服务等级目标下,将近似最近邻(ANN)搜索扩展到高召回率,仍然从根本上受限于内存容量和I/O带宽。基于磁盘的向量搜索系统在高精度下会遭受严重的延迟退化,而完全内存中的解决方案在十亿级规模下会产生过高的内存成本。尽管缓存在传统数据库中扮演核心角色,向量搜索却缺乏一个能够分摊重复查询工作的通用查询级缓存层。我们提出了QVCache,这是首个具有有限内存占用的、与后端无关的ANN搜索查询级缓存系统。QVCache通过执行相似性感知缓存而非精确匹配查找,来利用语义查询重复性。它使用在线学习算法动态学习特定区域的距离阈值,从而在独立于数据集大小的情况下,实现保持召回的缓存命中,同时限制查找延迟和内存使用。QVCache可作为现有向量数据库的即插即用层。它保持兆字节级的内存占用,并实现亚毫秒级的缓存命中延迟,当与现有ANN系统集成时,可将端到端查询延迟降低高达40至1000倍。对于表现出时间-语义局部性的工作负载,QVCache在保持与底层ANN后端相当的召回率的同时,显著降低了延迟,从而确立了其作为可扩展向量搜索中缺失但至关重要的缓存层的地位。