There are now over 20 commercial vector database management systems (VDBMSs), all produced within the past five years. But embedding-based retrieval has been studied for over ten years, and similarity search a staggering half century and more. Driving this shift from algorithms to systems are new data intensive applications, notably large language models, that demand vast stores of unstructured data coupled with reliable, secure, fast, and scalable query processing capability. A variety of new data management techniques now exist for addressing these needs, however there is no comprehensive survey to thoroughly review these techniques and systems. We start by identifying five main obstacles to vector data management, namely vagueness of semantic similarity, large size of vectors, high cost of similarity comparison, lack of natural partitioning that can be used for indexing, and difficulty of efficiently answering hybrid queries that require both attributes and vectors. Overcoming these obstacles has led to new approaches to query processing, storage and indexing, and query optimization and execution. For query processing, a variety of similarity scores and query types are now well understood; for storage and indexing, techniques include vector compression, namely quantization, and partitioning based on randomization, learning partitioning, and navigable partitioning; for query optimization and execution, we describe new operators for hybrid queries, as well as techniques for plan enumeration, plan selection, and hardware accelerated execution. These techniques lead to a variety of VDBMSs across a spectrum of design and runtime characteristics, including native systems specialized for vectors and extended systems that incorporate vector capabilities into existing systems. We then discuss benchmarks, and finally we outline research challenges and point the direction for future work.
翻译:目前已有超过20个商业向量数据库管理系统(VDBMS),这些系统均在过去五年内诞生。然而基于嵌入的检索已研究超过十年,而相似性搜索更是跨越半个多世纪。驱动这一从算法到系统转变的是新兴数据密集型应用(尤其是大型语言模型),它们需要存储海量非结构化数据,并具备可靠、安全、快速且可扩展的查询处理能力。现有多种新型数据管理技术可满足这些需求,但尚缺乏全面综述系统梳理这些技术与系统。我们首先识别出向量数据管理的五大核心挑战:语义相似性模糊性、向量维度规模庞大、相似性比较计算成本高、缺乏可用于索引的自然分区特性,以及难以高效响应需同时处理属性与向量的混合查询。针对这些挑战,学术界发展出查询处理、存储与索引、查询优化与执行三类新方法。在查询处理方面,多种相似度度量与查询类型已形成成熟认知;在存储与索引方面,技术包括向量压缩(即量化)以及基于随机化、学习型分区和可导航分区的分区策略;在查询优化与执行方面,我们描述了面向混合查询的新型算子,以及计划枚举、计划选择和硬件加速执行等技术。这些技术衍生出具有不同设计与运行时特征的各类VDBMS,包括专用于向量的原生系统,以及将向量能力集成至现有系统的扩展系统。最后,我们讨论基准测试工具,并梳理研究挑战以指明未来发展方向。